Skip to content

CNV frequencies halved in gene_cnv_frequencies_advanced with nobs_mode="fixed" #1019

@rehanxt5

Description

@rehanxt5

Summary

When gene_cnv_frequencies_advanced is called with nobs_mode="fixed", all returned CNV frequencies are exactly half the correct value. The nobs denominator multiplies by 2 when it shouldn't.

The Problem

In cnv_frq.py lines 587-588:

if nobs_mode == "called":
    nobs[:, cohort_index] = np.repeat(cohort_n_called, 2)
else:
    assert nobs_mode == "fixed"
    nobs[:, cohort_index] = cohort.size * 2  # ← BUG: should not multiply by 2

The issue: count represents number of samples with amp/del (line 580-581):

count[::2, cohort_index] = np.sum(cohort_is_amp, axis=1)   # sample count
count[1::2, cohort_index] = np.sum(cohort_is_del, axis=1)  # sample count

So frequency = count / nobs becomes sample_count / (samples × 2) — exactly half.

Why this is wrong

  1. The "called" mode proves the bug. It uses:

    nobs[:, cohort_index] = np.repeat(cohort_n_called, 2)

    np.repeat(x, 2) interleaves each value twice — it produces [10, 10, 20, 20, 30, 30], NOT [20, 40, 60]. So nobs = number of called samples, not doubled.

  2. The basic version confirms it. gene_cnv_frequencies (non-advanced) computes:

    frequency = amp_count_coh / called_count_coh  # sample count / sample count ✅
    nobs = called_count_coh                        # NO multiply by 2
  3. CNV calls are per-sample, not per-allele. Unlike SNPs where each diploid sample contributes 2 alleles, CNV CN_mode gives one copy number classification per sample.

Impact

  • All CNV frequencies from gene_cnv_frequencies_advanced(..., nobs_mode="fixed") are exactly half the true value
  • Downstream confidence intervals from _add_frequency_ci are artificially narrow (inflated nobs inflates precision)
  • Any spatial/temporal/population analyses using these frequencies show systematically deflated CNV rates

Example

10 samples with amp, nobs_mode="fixed":

  • Current (wrong): frequency = 10 / (10 × 2) = 0.50 → suggests 50% frequency
  • Correct: frequency = 10 / 10 = 1.0 → suggests 100% frequency (all samples have it)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions