Comparative Analysis of Clustering Algorithms: K-Means, DBSCAN, and GMM

📄 Abstract

This repository contains the implementation of our SRC (Student Research Conference, Zayed University) paper on clustering algorithm benchmarking. We present a systematic comparison of three widely used clustering paradigms — K-Means (centroid-based), DBSCAN (density-based), and Gaussian Mixture Models (GMM) (model-based) — across nine benchmark datasets with varying sizes, dimensionalities, and cluster complexities.

All algorithms are evaluated under a unified preprocessing and evaluation pipeline including duplicate removal, outlier filtering, feature standardization, and PCA-based visualization. Performance is assessed using purity, accuracy, macro-precision, macro-recall, macro F1-score, and silhouette score. Results provide a practical decision-making roadmap for clustering algorithm selection based on empirical data characteristics.

📂 Repository Structure

├── data/
│   └── README.md                    # Dataset descriptions and sources
├── notebooks/
│   ├── kmeans_clustering.ipynb      # K-Means implementation and evaluation
│   ├── dbscan_clustering.ipynb      # DBSCAN implementation and evaluation
│   └── gmm_clustering.ipynb         # GMM implementation and evaluation
├── paper/
│   └── Clustering_Comparative_Analysis.pdf   # Published conference paper
├── LICENSE
├── README.md
└── requirements.txt

This is the official code for the SRC paper: Comparative Analysis of Clustering Algorithms: K-Means, DBSCAN, and Gaussian Mixture Model Authors: Peter Yacoub, Mohamed Malek Kaouach, Soud Asaad Soud Alhazba, Mohammad Azmi Al-Betar (Ajman University, UAE) Download Paper (PDF)

⚙️ Methodology

Preprocessing: Duplicate removal, Z-score outlier filtering (|z| > 3), feature standardization
Algorithms: K-Means (RBF), DBSCAN (ε, MinPts tuned via k-distance curve), GMM (EM algorithm)
Datasets: 9 benchmark datasets — Banknote, Ionosphere, Sonar, Blobs, Varied, Wine, Flame, Glass, Iris
Evaluation: Purity, Accuracy, Macro-Precision, Macro-Recall, Macro F1-score, Silhouette Score
Visualization: PCA-based 2D scatter plots for qualitative cluster structure comparison

📊 Results Summary

DBSCAN

Dataset	Purity	Accuracy	F1	Silhouette
Blobs	100.00%	100.00%	1.000	0.81
Flame	100.00%	100.00%	1.000	0.31
Wine	97.46%	97.46%	0.977	0.63
Iris	94.90%	94.90%	0.943	0.67

K-Means

Dataset	Purity	Accuracy	F1	Silhouette
Varied	93.10%	93.10%	0.929	0.63
Wine	94.00%	93.80%	0.933	0.28
Flame	90.90%	67.30%	0.772	0.40

GMM

Dataset	Purity	Accuracy	F1	Silhouette
Blobs	100.00%	100.00%	1.000	0.81
Varied	98.87%	98.87%	0.989	0.60
Ionosphere	72.57%	72.57%	0.621	0.22

DBSCAN achieved the strongest overall robustness on non-convex and noisy datasets. K-Means performed best on compact, well-separated clusters. GMM excelled on elliptical and overlapping structures.

🗃️ Datasets

Nine labeled benchmark datasets were used for evaluation in a fully unsupervised manner:

Dataset	N	Dimensions	Classes
Banknote	1372	4	2
Ionosphere	351	34	2
Sonar	208	60	2
Blobs	1500	2	3
Varied	1500	2	3
Wine	178	13	3
Flame	240	2	2
Glass	214	9	6
Iris	150	4	3

⚠️ Datasets are not included in this repository. See data/README.md for download links.

🚀 Installation

git clone https://github.com/Soudk21/clustering-algorithms-benchmark.git
cd clustering-algorithms-benchmark
pip install -r requirements.txt

📓 Usage

Each notebook is self-contained. Run them independently:

jupyter notebook notebooks/kmeans_clustering.ipynb
jupyter notebook notebooks/dbscan_clustering.ipynb
jupyter notebook notebooks/gmm_clustering.ipynb

🤝 Acknowledgments

Affiliated with: Artificial Intelligence Research Center (AIRC), Ajman University, UAE
Submitted to: Student Research Conference (SRC), Zayed University

📜 Citation

If you use this code or findings in your research, please cite:

@inproceedings{yacoub2025clustering,
title={Comparative Analysis of Clustering Algorithms: K-Means, DBSCAN, and Gaussian Mixture Model},
author={Yacoub, Peter and Kaouach, Mohamed Malek and Alhazba, Soud Asaad Soud and Al-Betar, Mohammad Azmi},
booktitle={Proceedings of the Student Research Conference (SRC), Zayed University},
year={2025}
}

License

MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comparative Analysis of Clustering Algorithms: K-Means, DBSCAN, and GMM

📄 Abstract

📂 Repository Structure

⚙️ Methodology

📊 Results Summary

DBSCAN

K-Means

GMM

🗃️ Datasets

🚀 Installation

📓 Usage

🤝 Acknowledgments

📜 Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data		data
notebooks		notebooks
paper		paper
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Comparative Analysis of Clustering Algorithms: K-Means, DBSCAN, and GMM

📄 Abstract

📂 Repository Structure

⚙️ Methodology

📊 Results Summary

DBSCAN

K-Means

GMM

🗃️ Datasets

🚀 Installation

📓 Usage

🤝 Acknowledgments

📜 Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages