A Toolbox for Generalized Connectome-based Predictive Modeling
Understanding brain–behavior relationships and predicting cognitive and clinical outcomes from neuromarkers are central tasks in neuroscience. Connectome-based Predictive Modeling (CPM) has been widely adopted to predict behavioral traits from brain connectivity data; however, existing implementations are largely restricted to continuous outcomes, often overlook essential non-imaging covariates, and are difficult to apply in clinical or disease cohort settings. To address these limitations, we present GenCPM, a generalized CPM framework implemented in open-source R software. GenCPM extends traditional CPM by supporting binary, categorical, and time-to-event outcomes and allows the integration of covariates such as demographic and genetic information, thereby improving predictive accuracy and interpretability. To handle high-dimensional data, GenCPM incorporates marginal screening and regularized regression techniques, including LASSO, ridge, and elastic net, for efficient selection of informative brain connections. We demonstrate the utility of GenCPM through analyses of the Anti-Amyloid Treatment in Asymptomatic Alzheimer’s Disease (A4) Study and the Alzheimer’s Disease Neuroimaging Initiative (ADNI), showing enhanced predictive performance and improved signal attribution compared to standard methods. GenCPM offers a flexible, scalable, and interpretable solution for predictive modeling in brain connectivity research, supporting broader applications in cognitive and clinical neuroscience.
Fig. 1 Each subject provides a connectivity matrix and an outcome variable (e.g., a behavior or clinical measure). (a). In the original GenCPM framework, marginal screening is applied to select the top $K$ significant edges based on a predefined threshold $p$. The selected edges are separated into positively and negatively correlated sets, and summary measures, computed as sums of the average connectivity strength within each set, are derived for each subject. These connectivity-derived predictors are then combined with optional non-imaging covariates and entered into downstream models, including linear, logistic, multinomial, and CoxPH regression. (b). In the penalized GenCPM variant, the full set of selected edge features is retained without aggregation, allowing the model to capture fine-grained individual edge-level contributions. The resulting feature matrix, combined with non-imaging covariates, is then input into regularized regression models such as LASSO, ridge, and elastic net, where penalization is applied only to edge features, for joint modeling and feature selection while preserving the contribution of non-imaging covariates.You can install GenCPM from github with:
library(devtools)
install_github("BXU69/GenCPM")
library(GenCPM)The train.GenCPM is an intermediate function for training models that will be put into linear.GenCPM, logit.GenCPM, and multinom.GenCPM, and doesn't directly output predictions using test data, so we don't use it to make any prediction.
The linear.GenCPM, logit.GenCPM, multinom.GenCPM, and cox.GenCPM are four main functions to fit models and output predictions for respectively continuous, binary, categorical, and survival responses using Connectome-based Predicted Modeling.
linear.GenCPM(
connectome, behavior, x=NULL,
external.connectome = NULL, external.x = NULL,
cv="leave-one-out", k = dim(connectome)[3], correlation = "pearson",
thresh = .01, edge = "separate", seed = 1220
)logit.GenCPM(
connectome, behavior, x=NULL,
external.connectome = NULL, external.x = NULL,
cv="leave-one-out", k = dim(connectome)[3], correlation = "pearson",
thresh = .01, edge = "separate", seed = 1220
)multinom.GenCPM(
connectome, behavior, x=NULL,
external.connectome = NULL, external.x = NULL,
cv="leave-one-out", k = dim(connectome)[3], correlation = "pearson",
thresh = .01, edge = "separate", seed = 1220
)connectome: an array indicating the connectivity between M edges and over N subjects. The dimension should beM*M*N.behavior: a vector containing the behavior measure for all subjects.x: a data frame containing the non-image variables in the model.external.connectome: an external array indicating the connectivity for prediction.external.x: an external data frame containing the non-image variables for prediction.cv: a character indicating the method of cross-validation. The default method is"leave-one-out".k: a parameter used to set the number of folds for k-fold cross-validation.correlation: the method for finding the correlation between edge and behavior. The default is "pearson". Alternative approaches are "spearman" and "kendall".thresh: the value of the threshold for selecting significantly related edges. The default value is.01.edge: a character indicating the model is fitted with either positive and negative edges respectively or combined edges together. The default is"separate".seed: the value used to set seed for random sampling in the process of cross-validation. The default value is1220.
The functions will return a list containing the following output:
positive_edges: all selected edges having a significantly positive relationship with behavior response.negative_edges: all selected edges having a significantly negative relationship with behavior response.r_mat: a list of matrices consisting of Pearson correlation coefficient between edges and behavior.p_mat: a list of matrices consisting p-value from Pearson correlation between edges and behavior.positive_model: Fitted model using positive edges. Only applicable when users input external.connectome.negative_model: Fitted model using negative edges. Only applicable when users input external.connectome.combined_model: Fitted model using both positive and negative edges. Only applicable when users input external.connectome.positive_predicted_behavior: predicted behaviors from the model fitted using positive edges separately. Not applicable when the argumentedge = "combined".negative_predicted_behavior: predicted behaviors from the model fitted using negative edges separately. Not applicable when the argumentedge = "combined".predicted_behavior: predicted behaviors from the model fitted using all edges. Not applicable when the argumentedge = "separate".actual_behavior: Actual values of behavior response.
cox.GenCPM(
connectome, x=NULL, time, status,
external.connectome = NULL, external.x = NULL,
cv="leave-one-out", k = dim(connectome)[3],
thresh = .01, edge="separate", seed = 1220
)connectome: an array indicating the connectivity between M edges and over N subjects. The dimension should beM*M*N.x: a data frame containing the non-image variables in the model.time: the follow-up time for all individuals.status: the status indicator, normally 0=alive and 1=event.external.connectome: an external array indicating the connectivity for prediction.external.x: an external data frame containing the non-image variables for prediction.cv: a character indicating the method of cross-validation. The default method is"leave-one-out".k: a parameter used to set the number of folds for k-fold cross-validation.thresh: the value of the threshold for selecting significantly related edges. The default value is.01.edge: a character indicating the model is fitted with either positive and negative edges respectively or combined edges together. The default isseparate.seed: the value used to set seed for random sampling in the process of cross-validation. The default value is1220.
positive_edges: all selected edges having a significantly positive relationship with survival outcome in a marginal test.negative_edges: all selected edges having a significantly negative relationship with survival outcome in a marginal test.positive_model: Fitted model using positive edges. Only applicable when users input external.connectome.negative_model: Fitted model using negative edges. Only applicable when users input external.connectome.combined_model: Fitted model using both positive and negative edges. Only applicable when users input external.connectome.positive_predicted_linear_predictor: predicted linear predictors from the Cox model fitted using positive edges separately. Not applicable when the argumentedge = "combined".negative_predicted_linear_predictor: predicted linear predictors from the Cox model fitted using negative edges separately. Not applicable when the argumentedge = "combined".predicted_linear_predictor: predicted linear predictors from the Cox model fitted using all edges. Not applicable when the argumentedge = "separate".actual_time: Actual values of survival time.actual_status: Actual values of status indicator.
The linear.regularized.GenCPM, logit.regularized.GenCPM, multinom.regularized.GenCPM, and cox.regularized.GenCPM are four penalized-version functions of four .GenCPM functions by introducing LASSO, ridge, or elastic-net regularization.
linear.regularized.GenCPM(
connectome, behavior, x,
external.connectome = NULL, external.x = NULL,
cv="leave-one-out", k=dim(connectome)[3], correlation = "pearson",
thresh=.01, edge="separate", type="lasso",
lambda=NULL, alpha=NULL, seed=1220
)
logit.regularized.GenCPM(
connectome, behavior, x,
external.connectome = NULL, external.x = NULL,
cv="leave-one-out", k=dim(connectome)[3], correlation = "pearson",
thresh=.01, edge="separate", type="lasso",
lambda=NULL, alpha=NULL, seed=1220
)
multinom.regularized.GenCPM(
connectome, behavior, x,
external.connectome = NULL, external.x = NULL,
cv="leave-one-out", k = dim(connectome)[3], correlation = "pearson",
thresh = .01, edge = "separate", type="lasso",
lambda=NULL, alpha=NULL, seed = 1220
)
connectome: an array indicating the connectivity between M edges and over N subjects. The dimension should beM*M*N.behavior: a vector containing the behavior measure for all subjects.x: a data frame containing the non-image variables in the model.external.connectome: an external array indicating the connectivity for prediction.external.x: an external data frame containing the non-image variables for prediction.cv: a character indicating the method of cross-validation. The default method is"leave-one-out".k: a parameter used to set the number of folds for k-fold cross-validation.correlation: the method for finding the correlation between edge and behavior. The default is "pearson". Alternative approaches are "spearman" and "kendall".thresh: the value of the threshold for selecting significantly related edges. The default value is.01.edge: a character indicating the model is fitted with either positive and negative edges respectively or combined edges together. The default is"separate".type: type of penalty. The default is"lasso".lambda: the value of penalty.alpha: the alpha for elastic net penalty.seed: the value used to set seed for random sampling in the process of cross-validation. The default value is1220.
positive_edges: all selected edges having a significantly positive relationship with behavior response.negative_edges: all selected edges having a significantly negative relationship with behavior response.positive_model: Fitted model using positive edges. Only applicable when users input external.connectome.negative_model: Fitted model using negative edges. Only applicable when users input external.connectome.combined_model: Fitted model using both positive and negative edges. Only applicable when users input external.connectome.positive_predicted_behavior: predicted behaviors from the model fitted using positive edges separately. Not applicable when the argumentedge = "combined".negative_predicted_behavior: predicted behaviors from the model fitted using negative edges separately. Not applicable when the argumentedge = "combined".predicted_behavior: predicted behaviors from the model fitted using all edges. Not applicable when the argumentedge = "separate".actual_behavior: actual values of behavior response.positive_lambda_total: the final lambda indicating penalty used in the model fitted with positive edges separately for each fold during cross-validation. Not applicable whenedge = "combined.negative_lambda_total: the final lambda indicating penalty used in the model fitted with negative edges separately for each fold during cross-validation. Not applicable whenedge = "combined.lambda_total: the final lambda indicating penalty used in the model fitted with all edges for each fold during cross-validation. Not applicable whenedge = "separate".
cox.regularized.GenCPM(
connectome, x=NULL, time, status,
cv="leave-one-out", k = dim(connectome)[3], thresh = .01,
edge="separate", type="lasso", lambda=NULL, alpha=NULL, seed = 1220
)
connectome: an array indicating the connectivity between M edges and over N subjects. The dimension should beM*M*N.x: a data frame containing the non-image variables in the model.time: the follow-up time for all individuals.status: the status indicator, normally 0=alive and 1=event.external.connectome: an external array indicating the connectivity for prediction.external.x: an external data frame containing the non-image variables for prediction.cv: a character indicating the method of cross-validation. The default method is"leave-one-out".k: a parameter used to set the number of folds for k-fold cross-validation.correlation: the method for finding the correlation between edge and behavior. The default is "pearson". Alternative approaches are "spearman" and "kendall".thresh: the value of the threshold for selecting significantly related edges. The default value is.01.edge: a character indicating the model is fitted with either positive and negative edges respectively or combined edges together. The default is"separate".type: type of penalty. The default is"lasso".lambda: the value of penalty.alpha: the alpha for elastic net penalty.seed: the value used to set seed for random sampling in the process of cross-validation. The default value is1220.
positive_edges: all selected edges having a significantly positive relationship with survival outcome in a marginal test.negative_edges: all selected edges having a significantly negative relationship with survival outcome in a marginal test.positive_model: Fitted model using positive edges. Only applicable when users input external.connectome.negative_model: Fitted model using negative edges. Only applicable when users input external.connectome.combined_model: Fitted model using both positive and negative edges. Only applicable when users input external.connectome.positive_predicted_linear_predictor: predicted linear predictor from the model fitted using positive edges separately. Not applicable when the argumentedge = "combined".negative_predicted_linear_predictor: predicted linear predictor from the model fitted using negative edges separately. Not applicable when the argumentedge = "combined".predicted_linear_predictor: predicted linear predictor from the model fitted using all edges. Not applicable when the argumentedge = "separate".actual_status: actual values of status indicator.actual_time: actual values of survival time.positive_lambda_total: the final lambda indicating penalty used in the model fitted with positive edges separately for each fold during cross-validation. Not applicable whenedge = "combined.negative_lambda_total: the final lambda indicating penalty used in the model fitted with negative edges separately for each fold during cross-validation. Not applicable whenedge = "combined.lambda_total: the final lambda indicating penalty used in the model fitted with all edges for each fold during cross-validation. Not applicable whenedge = "separate".
The assess.GenCPM is the function to assess the model performance across testing folds with varying types of metrics based on the specific model type.
assess.GenCPM(
object, model="linear", edge="separate"
)
object: returned GenCPM object from.GenCPMor.regularized.GenCPMfunctions.model: a character string representing one of the built-in regression models."linear"forlinear.GenCPMandlinear.regularized.GenCPM;"logistic"forlogit.GenCPMandlogit.regularized.GenCPM; "multinom"" formultinom.GenCPMandmultinom.regularized.GenCPM; and"cox"forcox.GenCPMandcox.regularized.GenCPM. The default is"linear".edge: usage of edges to fit models, and it should be decided by the edge usage in the"object"input."seperate"for fitting two separate models using positive edges and negative edges respectively, and"combined"for fitting only one model use all edges selected. The default is"separate". The function will report error if theedgeis not correctly specified.
The output of assess.GenCPM is a list contains metrics assessing the model performance (MSE, AUC, multi-class AUC, and C-index, based on the specific model type), predicted response and actual response.
The heatmap.GenCPM is the function to visualize the slected edges either from thresholding or regularization based on the 10-node network label in Shen268 atlas.
heatmap.GenCPM(
cpm, foldThreshold = .5
)
cpm: returned GenCPM object from.GenCPMor.regularized.GenCPMfunctions.foldThreshold: the edges selected for over this many folds will be plotted. If set to .5, the edges selected at least half of the time are plotted.
The output of heatmap.GenCPM is a heatmap demonstrating the strength of the correlation between connectivity and response by the shade of the color, with red representing a positive correlation and blue representing a negative correlation.
We generate simulation data as the input data to illustrate how to use some functions of this package. The following example is the tutorial of the linear.GenCPM. The usage of other model fitting functions is the same case.
First, we generate connectome, which is a 268*268*500 array, and the behavior response y as follows. 268 is the preference dimension as the heatmap plotting selected edges uses 10-node label from Shen268 atlas.
set.seed(123)
N <- 500 # 500 individuals
M <- 268 # 268 edges
connectome <- array(0, dim = c(M, M, N)) # initialize the 3D array to store the connectivity matrix
edge <- matrix(NA, nrow = N, ncol = (M+1)*M/2) # to store the upper-triangle part of the matrix
index <- c(1:((M+1)*M/2))
pos_ind <- sample(index, ((M+1)*M/2)/3, replace = F) # randomly sample 1/3 edges to be positively correlated with the response
neg_ind <- sample(index[-pos_ind], ((M+1)*M/2)/3, replace = F) # randomly sample 1/3 edges to be negative correlated with the response
for (i in 1:N) {
mat <- matrix(runif(M*M, min = -1, max = 1), nrow = M) # generate random connectivity matrix
sym_mat <- (mat + t(mat)) / 2 # make the matrix symmetric
diag(sym_mat) <- 1 # set the diagonal to 1
connectome[, , i] <- sym_mat
edge[i,] <- sym_mat[upper.tri(sym_mat, diag=T)]
}
corr <- rep(0, (M+1)*M/2) # set correlation to be 0 for those edges not be selected
corr[pos_ind] <- 0.8 # set correlation to be 0.8 for those edges selected to be positive
corr[neg_ind] <- -0.8 # set correlation to be 0.8 for those edges selected to be negative
epsilon <- rnorm(N) # generate error term
y <- edge %*% corr + epsilon # generate response `y`Then, the simulation data are put into the linear.GenCPM to fit a linear regression model and make the prediction. We don't include non-image covariate x in this example and keep other settings as default.
lm.fit <- linear.GenCPM(connectome, y)The next step you may want to do is to assess the prediction by the assess.GenCPM.
assess.GenCPM(lm.fit, model = "linear", edge = "separate")Pay attention that you should specify the edge correctly, which is decided by the model you fitted with linear.GenCPM, otherwise it will report error. Also, remember to change the parameter model when switching to fit a logistic, multinomial logistic, or Cox model.
Finally, we can visualize the significant edges identified by GenCPM in a heatmap.
heatmap.GenCPM(lm.fit, foldThreshold = 0.8)foldThreshold = 0.8 means that the edges selected for over 80% folds will be plotted. You can tune this parameter according to your need.
Please cite the paper when you use the GenCPM package:

