Skip to content

aaakash06/GMM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GMM Simulator with Robust EM Algorithm

An interactive Streamlit application for Gaussian Mixture Model clustering with a robust Expectation-Maximization implementation. Upload your CSV data, preprocess it, configure the EM algorithm, and explore comprehensive visualizations of clustering results.

Features

Data Preprocessing

  • Automatic detection and removal of non-numeric/ID columns
  • Multiple missing value strategies (drop rows, fill with mean/median)
  • Outlier handling using IQR method (clip or remove)
  • Feature selection and standardization (StandardScaler)
  • Real-time preprocessing report

Robust EM Algorithm

  • K-means++ initialization for intelligent starting points
  • Log-sum-exp trick for numerical stability
  • Covariance regularization to prevent singular matrices
  • Multiple random restarts to avoid local optima
  • Full convergence tracking with log-likelihood history
  • BIC and AIC scoring for model selection

Visualization Dashboard

  • Data Overview Tab: Data preview, statistical summary, feature distributions, correlation heatmap
  • Clusters Tab: 2D/3D PCA scatter plots, cluster size distribution, per-cluster feature violin plots
  • Convergence Tab: Log-likelihood convergence curve, per-iteration improvement chart
  • Model Selection Tab: BIC/AIC comparison across K values, automatic best-K detection
  • Parameters Tab: Component-wise parameters (means, std devs, covariances), responsibility heatmap

Installation

Prerequisites

  • Python 3.9 or higher

Setup

  1. Clone the repository and navigate to the project directory:
cd /path/to/GMM
  1. Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

Requirements

streamlit>=1.56.0
numpy>=2.4.4
pandas>=3.0.2
scipy>=1.17.1
scikit-learn>=1.6.1
plotly>=6.7.0

Usage

Starting the Application

streamlit run main.py

The app will open in your browser at http://localhost:8501

Workflow

  1. Upload Data: Click "Browse files" to upload a CSV file
  2. Preprocess: Configure data cleaning in the sidebar
    • Handle missing values (drop/mean/median)
    • Enable/disable standardization
    • Auto-drop ID/text columns
    • Handle outliers (clip/remove)
    • Select specific features (optional)
  3. Configure EM Algorithm:
    • Set number of components (K)
    • Adjust max iterations and tolerance
    • Set covariance regularization
    • Choose number of random restarts
  4. Run Simulation: Click the "Run Simulation" button
  5. Explore Results: Navigate through tabs to analyze results

Algorithm Details

Robust EM Implementation

The Expectation-Maximization algorithm implemented includes several robustness features:

1. K-means++ Initialization

Instead of random initialization, we use K-means++ which:

  • Selects initial centers that are spread apart
  • Reduces likelihood of poor local optima
  • Improves convergence speed

2. Numerical Stability

  • Log-sum-exp trick: Prevents underflow when computing responsibilities
  • Cholesky decomposition: Stable computation of log Gaussian densities
  • Covariance regularization: Adds small diagonal terms to prevent singularity

3. Multiple Restarts

The algorithm runs multiple times with different initializations and returns the best solution based on log-likelihood.

4. Convergence Monitoring

  • Tracks log-likelihood at each iteration
  • Stops when improvement falls below tolerance
  • Provides full convergence history for analysis

5. Model Selection

  • BIC (Bayesian Information Criterion): Penalizes model complexity
  • AIC (Akaike Information Criterion): Balances fit and complexity
  • Automatic sweep across K values for optimal selection

Example Dataset

A sample credit card customer dataset (GMM_dataset.csv) is included with features:

  • BALANCE: Account balance
  • PURCHASES: Total purchases
  • ONEOFF_PURCHASES: Maximum single purchase
  • CASH_ADVANCE: Cash advance amount
  • CREDIT_LIMIT: Credit limit
  • PAYMENTS: Total payments
  • And more...

Project Structure

GMM/
├── main.py              # Streamlit application
├── gmm_engine.py        # Robust GMM/EM implementation
├── requirements.txt     # Python dependencies
├── GMM_dataset.csv      # Sample dataset
└── README.md           # This file

Technical Notes

PCA for Visualization

When data has more than 2 dimensions, PCA is automatically applied for scatter plots:

  • 2D view: First two principal components
  • 3D view: First three principal components

Memory Management

The app uses Streamlit's caching (@st.cache_data) to:

  • Store preprocessing results
  • Cache trained models
  • Accelerate repeated computations

Performance

  • Efficient numpy/scipy operations
  • Vectorized computations
  • Optimized for datasets up to 100K rows

Troubleshooting

Issue: "ModuleNotFoundError: No module named 'sklearn'"

  • Solution: Run pip install scikit-learn

Issue: Dashboard not loading

  • Solution: Ensure you're in the project directory and venv is activated

Issue: Slow convergence

  • Solution: Reduce max iterations, increase tolerance, or reduce n_init

Issue: Singular matrix warnings

  • Solution: Increase reg_covar parameter in sidebar

License

This project is provided as-is for educational and research purposes.

Contributing

Contributions are welcome! Areas for improvement:

  • Additional preprocessing options
  • More visualization types
  • Parallel processing for large datasets
  • Additional initialization methods
  • Online/batch EM variants

About

GMM Simulator with Robust EM Algorithm

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages