Skip to content

Analysis MergeNode optimisations #1619

Description

@davmlaw

🤖 Written by Claude

Overview

When multiple SampleNode/CohortNode arms feed into a MergeNode, and those arms are over the same cohort (same CohortGenotypeCollection), the merge currently runs each arm as a separate full pass over the cohortgenotype partition, materialises each arm's PKs into a Python list, and combines them as Variant WHERE id IN (listA) OR id IN (listB).

The arms already share one join alias (cohortgenotype_<cgc_pk>), so in principle the same result could be produced in a single pass — exactly what a CohortNode does (one join, one OR/regex predicate over samples_zygosity). The duplicate passes are the optimisation target.

Why it happens

arg_q_dict expresses AND-of-keyed-filters, applied one annotation alias at a time (AnalysisNode.get_queryset). A merge needs OR across arms, and:

  • Each sample's zygosity filter is keyed under a per-sample alias (sample_<pk>, a Substr over the shared join), so distinct arms never share a key.
  • The format has no representation for "OR across different annotation-dependent keys".

So MergeNode._split_common_filters can only OR-combine None-keyed (annotation-independent) filters; any arm with a non-None key falls back to parent.get_queryset() → materialise PKs → Q(pk__in=...).

Notes / scope

  • Applies to ancestors, not just direct parents. arg_q_dict and its stable aliases propagate up through intermediate filter nodes unchanged, so the same-cohort arms are detectable however deep the SampleNode sits below the merge. This is why a fix keyed on shared annotation join generalises, whereas a fix keyed on "are my parents SampleNodes" would miss the (common) case of filtering on each arm before merging.
  • Only large arms matter. Arms resolving to ≤ ANALYSIS_NODE_STORE_ID_SIZE_MAX (default 1000) are already collapsed to literal PK lists upstream (get_small_parent_arg_q_dict), so they neither benefit nor need the rewrite. The saving lands on large arms — which is also where the duplicate passes hurt most.
  • Gate on shared join, not blanket OR. Combining into one pass is only a win when arms share the same cohortgenotype join. Cross-cohort arms would multiply LEFT JOINs under the OR and likely regress — they should stay on the current materialise-then-OR path (the existing worst-case-protection behaviour).

Possible approaches

  1. Special-case same-cohort merge — detect arms over one CohortGenotypeCollection and rebuild a cohort-style single-pass query. Simple-ish but a parallel query path that must re-handle quality/AF/gene-list/VCF/sub-cohort filters, and breaks on intermediate filter nodes if keyed on node class.
  2. General filter rewriting (preferred) — extend the merge/get_queryset protocol so an OR-Q can reference multiple already-annotated aliases that share a join: annotate them all up front, then apply a single OR .filter(). Naturally handles ancestors and degrades safely to today's behaviour for distinct joins.

Relevant code

  • analysis/models/nodes/filters/merge_node.py_split_common_filters, _get_merged_q_dict
  • analysis/models/nodes/analysis_node.pyget_queryset (annotate-then-filter loop), get_arg_q_dict, get_small_parent_arg_q_dict
  • analysis/models/nodes/cohort_mixin.py, snpdb/models/models_vcf.py (Sample.get_cohort_genotype_alias_and_field), snpdb/models/models_cohort.py (cohortgenotype_alias, get_zygosity_q)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions