Research, benchmarking and optimising clustering algorithms for GPU integration.
This repository contains a comprehensive implementation and evaluation framework for clustering algorithms, with a focus on GPU-accelerated HDBSCAN. The project compares performance between our custom GPU-enabled HDBSCAN implementation, scikit-learn's HDBSCAN, and DBSCAN across various datasets and parameters.
clustering/
├── gpu_hdbscan/ # GPU-accelerated HDBSCAN implementation
│ ├── boruvka/ # Borůvka's algorithm implementation
│ ├── kd_tree/ # KD-tree data structure implementation
│ ├── single_linkage/ # Single linkage clustering components
│ ├── main.cpp # Main driver for GPU HDBSCAN
│ └── Makefile # Build configuration
├── utils/ # Utility functions and evaluation tools
│ ├── eval.py # Evaluation metrics and analysis
│ └── plot.py # Visualization and plotting functions
├── benchmark_integrated.py # Comprehensive benchmarking suite
├── findEps.py # DBSCAN parameter optimization
└── README.md # This file
Our custom GPU-accelerated implementation of the HDBSCAN clustering algorithm, organized into modular components:
boruvka/: Implementation of Borůvka's minimum spanning tree algorithm for efficient cluster hierarchy constructionkd_tree/: KD-tree data structure implementation for fast nearest neighbor searchessingle_linkage/: Single linkage clustering components used in the hierarchical clustering processmain.cpp: Main driver program that coordinates all componentsMakefile: Build system that compiles all C++ files across directories and produces thegpu_hdbscanexecutable in thebuild/directory
Our attempt at converting the Cython code of scikit-learn into Python Code for us to debug at each juncture, the output of the HDBSCAN algorithm with ours, for more comprehensive analysis.
utils/: Contains _param_validation.py which validates the data types used in the other python scripts within foldersk_hdbscan.py: Overall wrapper which provides access to HDBSCAN functionrecreation.py: Implementation of functions used by HDBSCAN
Evaluation framework providing:
- Clustering quality metrics
- Performance measurement tools
- Statistical analysis functions
- Comparative evaluation between different algorithms
Visualization toolkit for:
- Performance comparison charts
- Clustering result visualizations
- Parameter sensitivity analysis plots
- Comprehensive reporting graphics
The main benchmarking script that performs comprehensive performance comparisons between:
- Our GPU-accelerated HDBSCAN
- Scikit-learn's HDBSCAN implementation
- Scikit-learn's DBSCAN implementation
Features:
- Batched data processing for efficient memory usage
- Integration with evaluation and plotting utilities
- Generates detailed performance metrics and visual comparisons
- Outputs results in both CSV format and comprehensive plots
Parameter optimization utility for DBSCAN that determines optimal values for:
eps(epsilon): Maximum distance between two samples for them to be considered neighborsmin_samples: Minimum number of samples in a neighborhood for a point to be considered a core point
The jupyter notebook which contains code we used to visualise the simulated data
- C++ compiler with GPU support (HIP-capable)
- Python 3.x
- Required Python packages: scikit-learn, numpy, pandas, matplotlib as stated in requirements.txt
- CUDA toolkit (for GPU acceleration)
cd gpu_hdbscan
makeThis creates a build/ directory containing the gpu_hdbscan executable.
# Run comprehensive benchmarking
python benchmark_integrated.py
# Find optimal DBSCAN parameters
python findEps.py- GPU Acceleration: Leverages GPU computing for significant performance improvements in large-scale clustering tasks
- Comprehensive Benchmarking: Systematic comparison across multiple algorithms and datasets
- Modular Design: Clean separation of concerns with reusable components
- Visualization Tools: Rich plotting capabilities for result analysis and presentation
- Parameter Optimization: Automated parameter tuning for optimal clustering performance
This project focuses on:
- Optimizing clustering algorithms for GPU architectures
- Comparative analysis of clustering algorithm performance
- Scalability improvements for large-scale datasets
- Development of efficient hierarchical clustering methods
The benchmarking suite generates:
- Performance comparison CSV files
- Visual plots showing algorithm comparisons
- Statistical analysis of clustering quality
- Execution time and memory usage metrics