max_af is NaN when all cohorts are excluded by min_cohort_size

## Summary

When every cohort for a given variant has fewer samples than min_cohort_size,
all frequency columns (frq_*) for that row are set to NaN. In this case,
max_af is also NaN — but the documented contract says max_af is the
maximum allele frequency across all cohorts. NaN silently violates that

### contract and breaks the canonical downstream filter:
pythondf.query("effect == 'NON_SYNONYMOUS_CODING' and max_af >= 0.05")
Variants where every cohort was excluded are silently dropped from this
query (because NaN >= 0.05 is False), which is correct by accident.

But the inverse filter silently retains NaN rows:
pythondf.query("max_af < 0.05")   # NaN rows are included — wrong
df[df["max_af"].isna()]     # Users don't expect any NaN rows here
This also breaks any arithmetic on max_af (e.g. ranking, plotting) without
the user being warned that some values are missing.

### Steps to Reproduce
pythonimport malariagen_data
ag3 = malariagen_data.Ag3()

# Use a very large min_cohort_size so every cohort is excluded
df = ag3.snp_allele_frequencies(
    transcript="AGAP004707-RD",     # Vgsc
    cohorts="admin1_year",
    sample_sets="3.0",
    sample_query="country == 'Ghana'",
    min_cohort_size=10_000,          # No cohort will reach this
    drop_invariant=False,
)

# All frq_* columns are NaN — and so is max_af
print(df["max_af"].isna().all())    # True (unexpected)
print(df["max_af"].dtype)           # float64, no NaN expected by users

# The inverse filter silently includes these rows
print(len(df.query("max_af < 0.05")))  # includes NaN rows — wrong
Expected Behaviour
max_af should be 0.0 (not NaN) when every cohort is excluded by
min_cohort_size. The documented contract — "the maximum allele frequency
across all cohorts" — is 0.0 when no cohort data is available, not NaN.
The canonical training-course filter max_af >= 0.05 should silently discard
these rows (which it does correctly today), but users should not be surprised by
NaN appearing in a column they expect to always hold a numeric frequency.

## Actual Behaviour

max_af is NaN for any variant where every cohort column is NaN.

## Root Cause

In malariagen_data/anoph/snp_frq.py, max_af is computed as:
pythondf["max_af"] = df[frq_cols].max(axis=1)
pandas.DataFrame.max(axis=1) returns NaN when all values in a row are
NaN (i.e. all cohorts excluded). The fix is a single .fillna(0):
pythondf["max_af"] = df[frq_cols].max(axis=1).fillna(0)
The same pattern likely applies to aa_allele_frequencies() and
hap_allele_frequencies() which compute max_af in the same way.
Impact

Silent wrong results for users doing max_af < X or ~(max_af >= X) filters
NaN in max_af surprises users who plot or rank frequencies
Inconsistency: drop_invariant=True (the default) hides the issue, but
drop_invariant=False exposes it

Proposed Fix
See the linked PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

max_af is NaN when all cohorts are excluded by min_cohort_size #978

Summary

contract and breaks the canonical downstream filter:

Steps to Reproduce

Use a very large min_cohort_size so every cohort is excluded

All frq_* columns are NaN — and so is max_af

The inverse filter silently includes these rows

Actual Behaviour

Root Cause

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

max_af is NaN when all cohorts are excluded by min_cohort_size #978

Description

Summary

contract and breaks the canonical downstream filter:

Steps to Reproduce

Use a very large min_cohort_size so every cohort is excluded

All frq_* columns are NaN — and so is max_af

The inverse filter silently includes these rows

Actual Behaviour

Root Cause

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions