-
Notifications
You must be signed in to change notification settings - Fork 99
Description
Summary
When gene_cnv_frequencies_advanced is called with nobs_mode="fixed", all returned CNV frequencies are exactly half the correct value. The nobs denominator multiplies by 2 when it shouldn't.
The Problem
In cnv_frq.py lines 587-588:
if nobs_mode == "called":
nobs[:, cohort_index] = np.repeat(cohort_n_called, 2)
else:
assert nobs_mode == "fixed"
nobs[:, cohort_index] = cohort.size * 2 # ← BUG: should not multiply by 2The issue: count represents number of samples with amp/del (line 580-581):
count[::2, cohort_index] = np.sum(cohort_is_amp, axis=1) # sample count
count[1::2, cohort_index] = np.sum(cohort_is_del, axis=1) # sample countSo frequency = count / nobs becomes sample_count / (samples × 2) — exactly half.
Why this is wrong
-
The "called" mode proves the bug. It uses:
nobs[:, cohort_index] = np.repeat(cohort_n_called, 2)
np.repeat(x, 2)interleaves each value twice — it produces[10, 10, 20, 20, 30, 30], NOT[20, 40, 60]. So nobs = number of called samples, not doubled. -
The basic version confirms it.
gene_cnv_frequencies(non-advanced) computes:frequency = amp_count_coh / called_count_coh # sample count / sample count ✅ nobs = called_count_coh # NO multiply by 2
-
CNV calls are per-sample, not per-allele. Unlike SNPs where each diploid sample contributes 2 alleles, CNV
CN_modegives one copy number classification per sample.
Impact
- All CNV frequencies from
gene_cnv_frequencies_advanced(..., nobs_mode="fixed")are exactly half the true value - Downstream confidence intervals from
_add_frequency_ciare artificially narrow (inflated nobs inflates precision) - Any spatial/temporal/population analyses using these frequencies show systematically deflated CNV rates
Example
10 samples with amp, nobs_mode="fixed":
- Current (wrong): frequency = 10 / (10 × 2) = 0.50 → suggests 50% frequency
- Correct: frequency = 10 / 10 = 1.0 → suggests 100% frequency (all samples have it)