clustering

Research, benchmarking and optimising clustering algorithms for GPU integration.

Overview

This repository contains a comprehensive implementation and evaluation framework for clustering algorithms, with a focus on GPU-accelerated HDBSCAN. The project compares performance between our custom GPU-enabled HDBSCAN implementation, scikit-learn's HDBSCAN, and DBSCAN across various datasets and parameters.

Repository Structure

clustering/
├── gpu_hdbscan/                 # GPU-accelerated HDBSCAN implementation
│   ├── boruvka/                 # Borůvka's algorithm implementation
│   ├── kd_tree/                 # KD-tree data structure implementation  
│   ├── single_linkage/          # Single linkage clustering components
│   ├── main.cpp                 # Main driver for GPU HDBSCAN
│   └── Makefile                 # Build configuration
├── utils/                       # Utility functions and evaluation tools
│   ├── eval.py                  # Evaluation metrics and analysis
│   └── plot.py                  # Visualization and plotting functions
├── benchmark_integrated.py      # Comprehensive benchmarking suite
├── findEps.py                   # DBSCAN parameter optimization
└── README.md                    # This file

Components

GPU HDBSCAN Implementation (`gpu_hdbscan/`)

Our custom GPU-accelerated implementation of the HDBSCAN clustering algorithm, organized into modular components:

boruvka/: Implementation of Borůvka's minimum spanning tree algorithm for efficient cluster hierarchy construction
kd_tree/: KD-tree data structure implementation for fast nearest neighbor searches
single_linkage/: Single linkage clustering components used in the hierarchical clustering process
main.cpp: Main driver program that coordinates all components
Makefile: Build system that compiles all C++ files across directories and produces the gpu_hdbscan executable in the build/ directory

sk_learn_hdbscan(`sk_learn_hdbscan/`)

Our attempt at converting the Cython code of scikit-learn into Python Code for us to debug at each juncture, the output of the HDBSCAN algorithm with ours, for more comprehensive analysis.

utils/: Contains _param_validation.py which validates the data types used in the other python scripts within folder
sk_hdbscan.py: Overall wrapper which provides access to HDBSCAN function
recreation.py: Implementation of functions used by HDBSCAN

Utilities (`utils/`)

`eval.py`

Evaluation framework providing:

Clustering quality metrics
Performance measurement tools
Statistical analysis functions
Comparative evaluation between different algorithms

`plot.py`

Visualization toolkit for:

Performance comparison charts
Clustering result visualizations
Parameter sensitivity analysis plots
Comprehensive reporting graphics

Benchmarking and Evaluation

`benchmark_integrated.py`

The main benchmarking script that performs comprehensive performance comparisons between:

Our GPU-accelerated HDBSCAN
Scikit-learn's HDBSCAN implementation
Scikit-learn's DBSCAN implementation

Features:

Batched data processing for efficient memory usage
Integration with evaluation and plotting utilities
Generates detailed performance metrics and visual comparisons
Outputs results in both CSV format and comprehensive plots

`findEps.py`

Parameter optimization utility for DBSCAN that determines optimal values for:

eps (epsilon): Maximum distance between two samples for them to be considered neighbors
min_samples: Minimum number of samples in a neighborhood for a point to be considered a core point

Data Analysis and Visualisation

`eda.ipynb`

The jupyter notebook which contains code we used to visualise the simulated data

Building and Running

Prerequisites

C++ compiler with GPU support (HIP-capable)
Python 3.x
Required Python packages: scikit-learn, numpy, pandas, matplotlib as stated in requirements.txt
CUDA toolkit (for GPU acceleration)

Building GPU HDBSCAN

cd gpu_hdbscan
make

This creates a build/ directory containing the gpu_hdbscan executable.

Running Benchmarks

# Run comprehensive benchmarking
python benchmark_integrated.py

# Find optimal DBSCAN parameters
python findEps.py

Features

GPU Acceleration: Leverages GPU computing for significant performance improvements in large-scale clustering tasks
Comprehensive Benchmarking: Systematic comparison across multiple algorithms and datasets
Modular Design: Clean separation of concerns with reusable components
Visualization Tools: Rich plotting capabilities for result analysis and presentation
Parameter Optimization: Automated parameter tuning for optimal clustering performance

Research Focus

This project focuses on:

Optimizing clustering algorithms for GPU architectures
Comparative analysis of clustering algorithm performance
Scalability improvements for large-scale datasets
Development of efficient hierarchical clustering methods

Output

The benchmarking suite generates:

Performance comparison CSV files
Visual plots showing algorithm comparisons
Statistical analysis of clustering quality
Execution time and memory usage metrics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

clustering

Overview

Repository Structure

Components

GPU HDBSCAN Implementation (`gpu_hdbscan/`)

sk_learn_hdbscan(`sk_learn_hdbscan/`)

Utilities (`utils/`)

`eval.py`

`plot.py`

Benchmarking and Evaluation

`benchmark_integrated.py`

`findEps.py`

Data Analysis and Visualisation

`eda.ipynb`

Building and Running

Prerequisites

Building GPU HDBSCAN

Running Benchmarks

Features

Research Focus

Output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
gpu_hdbscan		gpu_hdbscan
sk_learn_hdbscan		sk_learn_hdbscan
utils		utils
.gitignore		.gitignore
README.md		README.md
benchmark_integrated.py		benchmark_integrated.py
eda.ipynb		eda.ipynb
findEps.py		findEps.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

clustering

Overview

Repository Structure

Components

GPU HDBSCAN Implementation (gpu_hdbscan/)

sk_learn_hdbscan(sk_learn_hdbscan/)

Utilities (utils/)

eval.py

plot.py

Benchmarking and Evaluation

benchmark_integrated.py

findEps.py

Data Analysis and Visualisation

eda.ipynb

Building and Running

Prerequisites

Building GPU HDBSCAN

Running Benchmarks

Features

Research Focus

Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

GPU HDBSCAN Implementation (`gpu_hdbscan/`)

sk_learn_hdbscan(`sk_learn_hdbscan/`)

Utilities (`utils/`)

`eval.py`

`plot.py`

`benchmark_integrated.py`

`findEps.py`

`eda.ipynb`

Packages