DSQRM: Distribution-on-scalar Single-index Quantile Regression Model for Handling Tumor Heterogeneity
This paper develops a distribution-on-scalar single-index quantile regression modeling framework to investigate the relationship between cancer imaging responses and scalar covariates of interest while tackling tumor heterogeneity. Conventional association analysis methods typically assume that the imaging responses are well-aligned after some preprocessing steps. However, this assumption is often violated in practice due to imaging heterogeneity. Although some distribution-based approaches are developed to deal with this heterogeneity, major challenges have been posted due to the nonlinear subspace formed by the distributional responses, the unknown nonlinear association structure, and the lack of statistical inference. Our method can successfully address all the challenges. We establish both estimation and inference procedures for the unknown functions in our model. The asymptotic properties of both estimation and inference procedures are systematically investigated. The finite-sample performance of our proposed method is assessed by using both Monte Carlo simulations and a real data example on brain cancer images from TCIA-GBM collection.

Fig 2. The interface of the software.
- ./utilities/: contains all the user-defined functions
- ./simu_results/:
- simu_n%d_N%d_m%d_p%d_nsimu%d.mat: the simulated dataset with density estimators & LQD representations;
- simu_estimation_error.csv: the mean & std of ISE of estimated functions by our method and the competitors, SIVC (2016) & PWSI
- ./Software_DSQRM/
- ./DSQRM_Installer_web.exe: the installation file for the software "DSQRM"
- require MATLAB Runtime, which will automatically download during the installation
- ./DSQRM_V1.mlapp: the source code for the software.
- ./for_redistribution_files_only/: contains the executable file on Windows
- ./README_DSQRM_software.md: the specific usage of the software
- ./software_guide.pdf: an example of step-by-step usage.
- ./examples/: contains the dataset and results for GBM study
- ./DSQRM_Installer_web.exe: the installation file for the software "DSQRM"
figure3.m: estimators & 95% SCB for a simulated dataset
-
Settings: n = 200; p = 2; N = 1000; m = 100;
$\tau$ = 0.5; - Expected running time: ~ 30 mins on
Intel(R) Core(TM) i7-8700 CPU
figure4.m: estimators & 95% SCB for the GBM dataset
- n = 101; p = 5; m = 100;
$\tau$ = 0.5 - Expected running time: ~ 20 mins on
Intel(R) Core(TM) i7-8700 CPU
table2.m: estimation performance from simulated datasets
Settings: p = 2; N = 500; nsimu=200 (# of simulated datasets);
Given that it's time-consuming to average the estimation performance over 200 simulated datasets, here we separate Table 2 into three parts, corresponding to the three different combinations of sample size n and number of grids m.
- part 1: n = 100; m = 100;
- part 2: n = 100; m = 200;
- part 3: n = 200; m = 100;
- After each part, the procedure will pause and the estimation results will be displayed in the command window, i.e., the MATLAB console;
- press "Enter" to continue running the following part;
-
tau_set: targeted quantile level(s), take values from (0.1, 0.3, 0.5, 0.7, 0.9) - In each part, it's set to run the proposed method with
$\tau=0.5$ . -
Expected running time with
tau_set=0.5: ~ 10 hrs using a parallel pool with 6 workers onIntel(R) Core(TM) i7-8700 CPU -
Note: to obtain results for all the quantile levels, i.e.,
tau_set=(0.1, 0.3, 0.5, 0.7, 0.9), please uncommment line 13, 24, 35 respectively for each part SIVC = true: compare with SIVC (2016) on the generated dataset- See the details in simu_main.m.
-
To compare with PWSI: run the following command after running "table2.m"
# part 1: n=100, p=2, m=100, N=500, nsimu=200 Rscript simu_PWSI.R 100 2 100 500 200 # part 2: n=100, p=2, m=200, N=500, nsimu=200 Rscript simu_PWSI.R 100 2 200 500 200 # part 3: n=200, p=2, m=100, N=500, nsimu=200 Rscript simu_PWSI.R 200 2 100 500 200
a table of the following 14 columns: "n", "N", "ISE_f_mean", "ISE_f_std", "m", "tau", "ISE_beta1_mean", "ISE_beta1_std", "ISE_beta2_mean", "ISE_beta2_std", "ISE_g_mean", "ISE_g_std", "ISE_Psi^{-1}(g)_mean", "ISE_Psi^{-1}(g)_std"
the function for calculating the estimation errors from simulated datasets measured by the mean and standard deviation (std) of ISE.
For different simulation settings, please modify the corresponding parameters.
- n: sample size, choose from [100,200]
- p: number of covariates, p = 2
- m: number of grids for lqd functions, choose from [100,200]
- N: number of data points in each sample, choose from [500,1000]
- nsimu: number of simulated datasets, nsimu = 200
- tau_set: targeted quantile levels, (0.1, 0.3, 0.5, 0.7, 0.9)
- SIVC: logical value, true or false, whether to get estimators by SIVC (2016)
Step 2. Get the estimators by our method, and calculate the mean & std of integrated squared errors (ISE) of the estimators;
(i) Extract the density estimators & LQD representations from the generated images;
(ii) Get the estimators of the functional coefficients & link function
given different quantile levels, (0.1, 0.3, 0.5, 0.7, 0.9);
(iii) Calculate the mean & std of ISE of the estimators and display the errors in a table.
Step 3. Get the estimators by SIVC (2016) based on extracted LQD representations, and calculate the mean & std of ISE of the estimators.
Settings: n=100, p=2, m=100, N=500, nsimu=200, tau_set= (0.1, 0.3, 0.5, 0.7, 0.9).
Run the command through MATLAB command prompt
[T_all, all_betaest, all_gest, all_dgest, all_gest_inv, ...
all_betaest_SIVC, all_gest_SIVC, all_gest_inv_SIVC] = ...
simu_main(100, 2, 100, 500, 200, 0.1:0.3:0.9, true);Run the R script "simu_PWSI.R" using the command line:
Rscript simu_PWSI.R 100 2 100 500 200the main function of the workflow shown in Fig 1.
- x: A set of covariates of interest
- v: Brain tumor images with pixel intensities
- m: The number of grids for the measurement of density estimators extracted from the images
- tau_set: a set of targeted quantile levels
- For other optional arguments, see the details in DSQRM.m
-
fhat: estimated denstiies, (n, N)
-
f_support: support of estimated densities, (n, N)
-
hf: bandwidth for density estimators, (1, n)
-
ally: LQD representation of fhat, (n, m)
-
all_betaest: estimated functional coefficients at targeted quantile levels, (p,m,ntau)
-
all_gest: estimated link function at target quantile levels, (n,m,ntau)
-
all_dgest: estimated first derivative of the link function at target quantile levels, (n,m,ntau)
-
all_gest_inv: the inversed transformation of estimated gest, (n,m,ntau)
-
Optional:
- all_Cb_beta: simultaneous confidence bands for estimated coefficients beta_l(s), l=1,...,p; (p,ntau)
- all_Cr_beta: simultaneous confidence region for estimated coefficient functions beta(s), (1, ntau)
- all_Cb_g: simultaneous confidence band for estimated link function gest(\cdot), (1, ntau)
- all_pvals: p-values of the hypothesis testing procedures at the targeted quantile levels, (1, ntau)
choose a file of format .csv or .mat, containing the table of covariates of interest, with corresponding variable names. (not including the intercept)
- .mat file: containing a variable named "xdesign", which is a table of size (n, p0), with the first column as the sample name, and the column names as the variable names;
- n: sample size
- p0: the number of covariates
- .csv file: the first column is the sample name, and the rest columns are the variables treated as the covariates of interests.
[!NOTE] the continuous variables need to be normalized before being loaded.
Click the button Load Image Data to choose the images to be analyzed, and the format of this input depends on the choice "Input Type of Images".
- whole images (default): 3D or 2D images of whole brain, with the corresponding masks for tumor regions.
- a folder containing the images & corresponding tumor masks;
- name of image file: SampleName + "image" + ".nii" / ".nii.gz"
- name of mask file: SampleName + 'mask' + ".nii" / ".nii.gz"
- ntype: the number of tumor subtypes, which is the length of the unique nonzero values in a mask.
- image pixels: extracted pixels of the tumor region for each sample.
- a matlab data file (.mat) containing the pixels extracted from the tumor region (named "tumor_pixels"), & the ratios of each tumor subtype (named "sub_tumor_ratios");
- tumor_pixels:
- a cell array of length n, each cell contains the pixels extracted from the tumor region;
- a matrix of size (n, N), if the number of pixels in the tumor region are the same;
- sub_tumor_ratios: a matrix of size (n, ntypes), where each row should be sum up to 1.
The final design matrix is a combination of the covariates of interest & the tumor ratios of subtypes, which is a matrix of size (n, p), where p = p0 + ntype - 1.
- Number of Grids: m, the number of grids for the measurement of log-quantile density transformation.
- Target Quantile:
$\tau$ , a scalar within the range of (0,1), denoting the target quantile level of the quantile regression.
Optional Input
Initials
Click the button Initials to choose a matlab data file (.mat) containing the initial values for the functional coefficients
$\beta(s)$ and the link function$g(\cdot)$ and its first derivative$\dot{g}(\cdot)$ .
The variable names should be
- beta0: a matrix of size (p, m)
- g0: a matrix of size (n, m)
- dg0: a matrix of size (n, m)
Bandwidth
numerical values within (0,1), controlling the smoothness
$h_\beta$ $h_g$ $h_\eta$
Click the button Output Folder to choose a folder where the extracted distributional representations & the results of our model to be saved.
The results will be saved in this folder by the name "results.mat".
Click the button Extract Distributional Representation, to have a visualization:
- the whole brain with tumor segmentation,
- the extracted density functions,
- the log-quantile density functions.
The sliders can be adjusted to visualize the representations of other samples.
Procedures
- load the image & mask files for each sample;
- extract the tumor pixels & the ratios of each subtype;
- obtain the density & log-quantile density (LQD) according to the extracted pixels;
- display the original image of whole brain, with the annotation of subtypes of tumors;
- display the extracted densities & LQDs.
- load the tumor pixels & the ratios of each subytpe;
- obtain the density & log-quantile density (LQD) according to the extracted pixels;
- display the extracted densities & LQDs.
- Run: click the button to start the algorithm and find the estimators
- the indicator light turns 🟢 while the program is running
- Pause: click the button to pause the process, and click again to resume the process
- the indicator light turns 🟡 while the program is paused
- note: it takes a while for the program to pause
- Stop: click the button to stop the program
- the indicator light turns 🔴
click the button the display the fitted functional coefficients and the link function.
- Significance level: default = 0.05
- Number of Bootstrap: default = 200
Click the button to start the bootstrap procedure for constructing SCB for both functional coefficients and link function
user can choose if to display the SCB together with the estimators by checking the box at the bottom of the right panel
Click the button to start the hypothesis test procedure for one of the functional coefficients
- idx: denotes to conduct the hypothesis test on which covariate
- p-value: the corresponding p-value after the bootstrap procedures
./Software_DSQRM/examples/: this folder contains both input files and output results of this example.
- images & masks: './examples/TCGA_flair_single_slice'
- covariates of interest: './examples/TCGA_GBM_covariates.csv'
- initial values for the functional coefficients & link function: './examples/initials.mat''examples/initials.mat'
- saved in the matlab file "./examples/results.mat"

Fig 3. An example of the analysis on GBM dataset using the software.
