A flexible data-driven framework for identifying clinical phenotypes using latent class and profile analysis
PhenoCluster is a Python framework for unsupervised discovery of clinical phenotypes from heterogeneous patient data. It implements an end-to-end pipeline: from data preprocessing and latent class identification to outcome association analysis, survival modelling, and multistate transition modelling.
The framework is domain-agnostic and can be applied to any clinical cohort study where the goal is to identify latent patient subgroups and characterise their relationship with clinical outcomes. Users supply a dataset and a YAML configuration file; PhenoCluster handles model selection, phenotype assignment, and downstream inference automatically.
- Latent Class / Profile Analysis via the StepMix framework with native support for mixed continuous/categorical data and missing values
- Automatic model selection using information criteria (BIC, AIC, ICL, CAIC, SABIC) with configurable cluster-size constraints
- Classification quality assessment with per-phenotype Average Posterior Probability (AvePP) and assignment confidence metrics
- Outcome association analysis with logistic regression yielding odds ratios, confidence intervals, and FDR-corrected p-values
- Survival analysis with Cox proportional hazards models producing hazard ratios and log-rank tests
- Multistate modelling with transition-specific Cox PH analysis, Monte Carlo simulation for state occupation probabilities with confidence interval bands, and clinical pathway enumeration
- Temporal and multi-site generalizability (v0.3.0) - validate phenotypes across time windows or sites/centers (cutoff, sliding/expanding windows, leave-one-site-out), with apply-only or refit-and-match modes, calibration metrics (Brier, ECE), drift detection (PSI, KS, chi-square), and per-phenotype OR/HR concordance with FDR-corrected delta tests
- Optional Streamlit dashboard (v0.3.0) for interactive exploration of saved results:
phenocluster dashboard <results_dir> - Comprehensive output including an interactive HTML report (toggleable via
generate_html_reportor--no-html-report), forest plots with confidence intervals, Kaplan-Meier and Nelson-Aalen curves, heatmaps, and JSON/CSV data exports
Requires Python >= 3.11
pip install phenoclusterTo enable the optional interactive dashboard:
pip install 'phenocluster[dashboard]'phenocluster create-config -p complete -o config.yamlOpen config.yaml and fill in your dataset-specific parameters:
global:
project_name: "My Study"
output_dir: "results"
random_state: 42
data:
continuous_columns:
- age
- bmi
- lab_value_1
categorical_columns:
- sex
- smoking_status
- disease_stage
split:
test_size: 0.2
outcome:
enabled: true
outcome_columns:
- mortality_30d
- readmission_30d
survival:
enabled: true
targets:
- name: "overall_survival"
time_column: "time_to_death"
event_column: "death_indicator"phenocluster run -d data.csv -c config.yamlResults are written to the output directory (default: results/):
| File | Description |
|---|---|
analysis_report.html |
Comprehensive HTML report (skip with generate_html_report: false or --no-html-report) |
cluster_statistics.json |
Phenotype sizes, feature distributions, and classification quality |
outcome_results.json |
Odds ratios with confidence intervals and p-values |
survival_results.json |
Kaplan-Meier estimates and Cox PH hazard ratios |
multistate_results.json |
Transition-specific hazard ratios, pathways, and state occupation |
data/model_fit_metrics.csv |
Information criteria, entropy, and average posterior probabilities |
data/phenotypes_data.csv |
Original data augmented with phenotype assignments |
data/posterior_probabilities.csv |
Posterior class membership probabilities |
results/model_selection_summary.json |
Model selection comparison table and best model info |
results/feature_importance.json |
Feature characterisation per phenotype |
results/validation_report.json |
Internal validation metrics (train/test comparison) |
results/stability_results.json |
Consensus clustering stability metrics |
results/split_info.json |
Train/test split details |
results/external_validation_results.json |
External validation results (when enabled) |
results/temporal_validation_results.json |
Temporal generalizability results (when enabled, v0.3.0) |
results/multisite_validation_results.json |
Multi-site (LOGO / holdout) generalizability results (v0.3.0) |
results/external_cohorts_results.json |
External-CSV generalizability results (v0.3.0) |
results/generalizability_summary.json |
Aggregate ARI / PSI summary across cohorts plus training_scope flag (v0.3.0) |
data/generalizability/ |
Per-cohort cluster_distribution_<label>.csv and drift_<label>.csv (v0.3.0) |
phenocluster.log |
Pipeline execution log |
artifacts/ |
Cached intermediate results for incremental re-runs |
Add a generalizability block to the config to enable temporal, multi-site, and/or external-CSV validation. The default training_scope: per_split fits a fresh preprocessor and StepMix model on the derivation rows of each in-CSV split and applies it to the validation rows. The pipeline's full-cohort model stays untouched for descriptive analyses.
generalizability:
enabled: true
training_scope: per_split # per_split (default) | global
feature_selector_scope: auto # auto (default) | global | per_split
refit: true # refit-and-match Hungarian alignment
min_validation_size_for_refit: 100
temporal:
time_column: admission_date
scheme: cutoff # cutoff | fraction | sliding | expanding
time_cutoff: "2020-12-31"
multisite:
site_column: center
scheme: logo # logo | holdout | pairwise
min_site_size: 30
external_cohorts: # optional, one or more separate CSVs
- { path: ./cohort_B.csv, label: hospital_X, kind: site }
- { path: ./cohort_2024.csv, label: era_2024, kind: temporal }
drift: { enabled: true, n_bins: 10, top_k: 20 }
calibration: { enabled: true, n_bins: 10, strategy: quantile }
outcome_concordance: { enabled: true, fdr_method: bh, alpha: 0.05 }Each cohort yields a phenotype distribution, drift table, refit-and-match metrics (ARI / NMI / Hungarian-matched accuracy), calibration metrics, and per-phenotype OR/HR concordance with FDR-corrected delta tests. Cohort reports also expose a fit_mode field (per_split for in-CSV splits under the default scope; global for external CSVs and the legacy permissive path) and derivation_only_ari showing how the fresh derivation-only fit compares to the global model.
pip install 'phenocluster[dashboard]'
phenocluster dashboard ./results/Streamlit launches at http://127.0.0.1:8501 with tabs for an Overview, Phenotypes, Outcomes, Survival, Multistate, Generalizability, and a per-cohort Drift explorer.
PhenoCluster executes the following stages in order:
- Data quality assessment. Missingness patterns, correlations, variance, and MCAR testing.
- Train/test split. Stratified splitting with configurable test size, performed before preprocessing to prevent data leakage.
- Preprocessing. Imputation, outlier handling, categorical encoding, standardization, and feature selection -- fit on training data only, then applied to the test set.
- Model selection. Cross-validated information criterion search over cluster counts (training set only).
- Full-cohort refit. Once K is selected, preprocessing and LCA/LPA model are refitted on the entire cohort; phenotypes reordered by size (largest = Phenotype 0).
- Stability analysis. Consensus clustering over subsampled runs.
- Internal validation. Train/test log-likelihood comparison, cluster proportion stability, and outcome OR consistency.
- Outcome association. Logistic regression for binary outcomes with FDR-corrected p-values (optional).
- Survival analysis. Kaplan-Meier curves, Nelson-Aalen estimators, log-rank tests, and Cox PH hazard ratios (optional).
- Multistate modelling. Transition-specific Cox PH models, transition hazard ratios, and Monte Carlo simulation (optional).
- Temporal / multi-site generalizability. Re-evaluate the derivation phenotypes on later time windows, held-out sites, and external CSVs; report ARI / NMI / matched accuracy, calibration, drift, and OR/HR concordance (optional, v0.3.0).
- Report generation. Interactive HTML report with all figures and tables.
| Command | Description |
|---|---|
phenocluster run -d DATA -c CONFIG [--force-rerun] [-v] [-q] [--html-report/--no-html-report] |
Run the full pipeline |
phenocluster create-config [-p PROFILE] [-o OUTPUT] |
Generate a config YAML from a profile template |
phenocluster validate-config -c CONFIG [-d DATA] |
Validate config structure; cross-check columns against data |
phenocluster list-profiles |
List available configuration profile templates |
phenocluster show-profile NAME |
Print the resolved YAML for a profile with syntax highlighting |
phenocluster dashboard RESULTS_DIR [--port 8501] [--host 127.0.0.1] [--headless/--browser] |
Launch the optional Streamlit dashboard (requires pip install 'phenocluster[dashboard]') |
phenocluster version |
Show version, repository link, and documentation link |
Profiles set sensible defaults for common use-cases. Generate one with phenocluster create-config -p <profile>:
| Profile | Description | Inference | Stability | Multistate |
|---|---|---|---|---|
descriptive |
Phenotype discovery only, no statistical inference | off | on | off |
complete |
All analyses enabled (outcomes, survival, multistate) | on | on | on |
quick |
Fast iteration for development | on | off | off |
See the full Configuration Reference in the documentation.
Full documentation (statistical methods, configuration reference, output descriptions) is available at ettorerocchi.github.io/phenocluster.
This project is licensed under the MIT License.
If you use PhenoCluster in your research, please cite:
Available soon.This project relies on StepMix, a Python package for pseudo-likelihood estimation of generalized mixture models with external variables. We thank the authors for making their work openly available.
If you use this framework, please cite also:
Morin, S., Legault, R., Laliberté, F., Bakk, Z., Giguère, C.-É., de la Sablonnière, R., & Lacourse, É. (2025). StepMix: A Python Package for Pseudo-Likelihood Estimation of Generalized Mixture Models with External Variables. Journal of Statistical Software, 113(8), 1-39. doi: 10.18637/jss.v113.i08
