Skip to content

max_af is NaN when all cohorts are excluded by min_cohort_size #978

@blankirigaya

Description

@blankirigaya

Summary

When every cohort for a given variant has fewer samples than min_cohort_size,
all frequency columns (frq_*) for that row are set to NaN. In this case,
max_af is also NaN — but the documented contract says max_af is the
maximum allele frequency across all cohorts. NaN silently violates that

contract and breaks the canonical downstream filter:

pythondf.query("effect == 'NON_SYNONYMOUS_CODING' and max_af >= 0.05")
Variants where every cohort was excluded are silently dropped from this
query (because NaN >= 0.05 is False), which is correct by accident.

But the inverse filter silently retains NaN rows:
pythondf.query("max_af < 0.05") # NaN rows are included — wrong
df[df["max_af"].isna()] # Users don't expect any NaN rows here
This also breaks any arithmetic on max_af (e.g. ranking, plotting) without
the user being warned that some values are missing.

Steps to Reproduce

pythonimport malariagen_data
ag3 = malariagen_data.Ag3()

Use a very large min_cohort_size so every cohort is excluded

df = ag3.snp_allele_frequencies(
transcript="AGAP004707-RD", # Vgsc
cohorts="admin1_year",
sample_sets="3.0",
sample_query="country == 'Ghana'",
min_cohort_size=10_000, # No cohort will reach this
drop_invariant=False,
)

All frq_* columns are NaN — and so is max_af

print(df["max_af"].isna().all()) # True (unexpected)
print(df["max_af"].dtype) # float64, no NaN expected by users

The inverse filter silently includes these rows

print(len(df.query("max_af < 0.05"))) # includes NaN rows — wrong
Expected Behaviour
max_af should be 0.0 (not NaN) when every cohort is excluded by
min_cohort_size. The documented contract — "the maximum allele frequency
across all cohorts" — is 0.0 when no cohort data is available, not NaN.
The canonical training-course filter max_af >= 0.05 should silently discard
these rows (which it does correctly today), but users should not be surprised by
NaN appearing in a column they expect to always hold a numeric frequency.

Actual Behaviour

max_af is NaN for any variant where every cohort column is NaN.

Root Cause

In malariagen_data/anoph/snp_frq.py, max_af is computed as:
pythondf["max_af"] = df[frq_cols].max(axis=1)
pandas.DataFrame.max(axis=1) returns NaN when all values in a row are
NaN (i.e. all cohorts excluded). The fix is a single .fillna(0):
pythondf["max_af"] = df[frq_cols].max(axis=1).fillna(0)
The same pattern likely applies to aa_allele_frequencies() and
hap_allele_frequencies() which compute max_af in the same way.
Impact

Silent wrong results for users doing max_af < X or ~(max_af >= X) filters
NaN in max_af surprises users who plot or rank frequencies
Inconsistency: drop_invariant=True (the default) hides the issue, but
drop_invariant=False exposes it

Proposed Fix
See the linked PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions