-
Notifications
You must be signed in to change notification settings - Fork 99
Description
Summary
When every cohort for a given variant has fewer samples than min_cohort_size,
all frequency columns (frq_*) for that row are set to NaN. In this case,
max_af is also NaN — but the documented contract says max_af is the
maximum allele frequency across all cohorts. NaN silently violates that
contract and breaks the canonical downstream filter:
pythondf.query("effect == 'NON_SYNONYMOUS_CODING' and max_af >= 0.05")
Variants where every cohort was excluded are silently dropped from this
query (because NaN >= 0.05 is False), which is correct by accident.
But the inverse filter silently retains NaN rows:
pythondf.query("max_af < 0.05") # NaN rows are included — wrong
df[df["max_af"].isna()] # Users don't expect any NaN rows here
This also breaks any arithmetic on max_af (e.g. ranking, plotting) without
the user being warned that some values are missing.
Steps to Reproduce
pythonimport malariagen_data
ag3 = malariagen_data.Ag3()
Use a very large min_cohort_size so every cohort is excluded
df = ag3.snp_allele_frequencies(
transcript="AGAP004707-RD", # Vgsc
cohorts="admin1_year",
sample_sets="3.0",
sample_query="country == 'Ghana'",
min_cohort_size=10_000, # No cohort will reach this
drop_invariant=False,
)
All frq_* columns are NaN — and so is max_af
print(df["max_af"].isna().all()) # True (unexpected)
print(df["max_af"].dtype) # float64, no NaN expected by users
The inverse filter silently includes these rows
print(len(df.query("max_af < 0.05"))) # includes NaN rows — wrong
Expected Behaviour
max_af should be 0.0 (not NaN) when every cohort is excluded by
min_cohort_size. The documented contract — "the maximum allele frequency
across all cohorts" — is 0.0 when no cohort data is available, not NaN.
The canonical training-course filter max_af >= 0.05 should silently discard
these rows (which it does correctly today), but users should not be surprised by
NaN appearing in a column they expect to always hold a numeric frequency.
Actual Behaviour
max_af is NaN for any variant where every cohort column is NaN.
Root Cause
In malariagen_data/anoph/snp_frq.py, max_af is computed as:
pythondf["max_af"] = df[frq_cols].max(axis=1)
pandas.DataFrame.max(axis=1) returns NaN when all values in a row are
NaN (i.e. all cohorts excluded). The fix is a single .fillna(0):
pythondf["max_af"] = df[frq_cols].max(axis=1).fillna(0)
The same pattern likely applies to aa_allele_frequencies() and
hap_allele_frequencies() which compute max_af in the same way.
Impact
Silent wrong results for users doing max_af < X or ~(max_af >= X) filters
NaN in max_af surprises users who plot or rank frequencies
Inconsistency: drop_invariant=True (the default) hides the issue, but
drop_invariant=False exposes it
Proposed Fix
See the linked PR.