🤖 Written by Claude
Overview
When multiple SampleNode/CohortNode arms feed into a MergeNode, and those arms are over the same cohort (same CohortGenotypeCollection), the merge currently runs each arm as a separate full pass over the cohortgenotype partition, materialises each arm's PKs into a Python list, and combines them as Variant WHERE id IN (listA) OR id IN (listB).
The arms already share one join alias (cohortgenotype_<cgc_pk>), so in principle the same result could be produced in a single pass — exactly what a CohortNode does (one join, one OR/regex predicate over samples_zygosity). The duplicate passes are the optimisation target.
Why it happens
arg_q_dict expresses AND-of-keyed-filters, applied one annotation alias at a time (AnalysisNode.get_queryset). A merge needs OR across arms, and:
- Each sample's zygosity filter is keyed under a per-sample alias (
sample_<pk>, a Substr over the shared join), so distinct arms never share a key.
- The format has no representation for "OR across different annotation-dependent keys".
So MergeNode._split_common_filters can only OR-combine None-keyed (annotation-independent) filters; any arm with a non-None key falls back to parent.get_queryset() → materialise PKs → Q(pk__in=...).
Notes / scope
- Applies to ancestors, not just direct parents.
arg_q_dict and its stable aliases propagate up through intermediate filter nodes unchanged, so the same-cohort arms are detectable however deep the SampleNode sits below the merge. This is why a fix keyed on shared annotation join generalises, whereas a fix keyed on "are my parents SampleNodes" would miss the (common) case of filtering on each arm before merging.
- Only large arms matter. Arms resolving to ≤
ANALYSIS_NODE_STORE_ID_SIZE_MAX (default 1000) are already collapsed to literal PK lists upstream (get_small_parent_arg_q_dict), so they neither benefit nor need the rewrite. The saving lands on large arms — which is also where the duplicate passes hurt most.
- Gate on shared join, not blanket OR. Combining into one pass is only a win when arms share the same cohortgenotype join. Cross-cohort arms would multiply LEFT JOINs under the OR and likely regress — they should stay on the current materialise-then-OR path (the existing worst-case-protection behaviour).
Possible approaches
- Special-case same-cohort merge — detect arms over one
CohortGenotypeCollection and rebuild a cohort-style single-pass query. Simple-ish but a parallel query path that must re-handle quality/AF/gene-list/VCF/sub-cohort filters, and breaks on intermediate filter nodes if keyed on node class.
- General filter rewriting (preferred) — extend the merge/
get_queryset protocol so an OR-Q can reference multiple already-annotated aliases that share a join: annotate them all up front, then apply a single OR .filter(). Naturally handles ancestors and degrades safely to today's behaviour for distinct joins.
Relevant code
analysis/models/nodes/filters/merge_node.py — _split_common_filters, _get_merged_q_dict
analysis/models/nodes/analysis_node.py — get_queryset (annotate-then-filter loop), get_arg_q_dict, get_small_parent_arg_q_dict
analysis/models/nodes/cohort_mixin.py, snpdb/models/models_vcf.py (Sample.get_cohort_genotype_alias_and_field), snpdb/models/models_cohort.py (cohortgenotype_alias, get_zygosity_q)
🤖 Written by Claude
Overview
When multiple
SampleNode/CohortNodearms feed into aMergeNode, and those arms are over the same cohort (sameCohortGenotypeCollection), the merge currently runs each arm as a separate full pass over the cohortgenotype partition, materialises each arm's PKs into a Python list, and combines them asVariant WHERE id IN (listA) OR id IN (listB).The arms already share one join alias (
cohortgenotype_<cgc_pk>), so in principle the same result could be produced in a single pass — exactly what aCohortNodedoes (one join, one OR/regex predicate oversamples_zygosity). The duplicate passes are the optimisation target.Why it happens
arg_q_dictexpresses AND-of-keyed-filters, applied one annotation alias at a time (AnalysisNode.get_queryset). A merge needs OR across arms, and:sample_<pk>, aSubstrover the shared join), so distinct arms never share a key.So
MergeNode._split_common_filterscan only OR-combineNone-keyed (annotation-independent) filters; any arm with a non-Nonekey falls back toparent.get_queryset()→ materialise PKs →Q(pk__in=...).Notes / scope
arg_q_dictand its stable aliases propagate up through intermediate filter nodes unchanged, so the same-cohort arms are detectable however deep theSampleNodesits below the merge. This is why a fix keyed on shared annotation join generalises, whereas a fix keyed on "are my parents SampleNodes" would miss the (common) case of filtering on each arm before merging.ANALYSIS_NODE_STORE_ID_SIZE_MAX(default 1000) are already collapsed to literal PK lists upstream (get_small_parent_arg_q_dict), so they neither benefit nor need the rewrite. The saving lands on large arms — which is also where the duplicate passes hurt most.Possible approaches
CohortGenotypeCollectionand rebuild a cohort-style single-pass query. Simple-ish but a parallel query path that must re-handle quality/AF/gene-list/VCF/sub-cohort filters, and breaks on intermediate filter nodes if keyed on node class.get_querysetprotocol so an OR-Q can reference multiple already-annotated aliases that share a join: annotate them all up front, then apply a single OR.filter(). Naturally handles ancestors and degrades safely to today's behaviour for distinct joins.Relevant code
analysis/models/nodes/filters/merge_node.py—_split_common_filters,_get_merged_q_dictanalysis/models/nodes/analysis_node.py—get_queryset(annotate-then-filter loop),get_arg_q_dict,get_small_parent_arg_q_dictanalysis/models/nodes/cohort_mixin.py,snpdb/models/models_vcf.py(Sample.get_cohort_genotype_alias_and_field),snpdb/models/models_cohort.py(cohortgenotype_alias,get_zygosity_q)