Analysis MergeNode optimisations

🤖 Written by Claude

## Overview

When multiple `SampleNode`/`CohortNode` arms feed into a `MergeNode`, and those arms are over the **same cohort** (same `CohortGenotypeCollection`), the merge currently runs each arm as a **separate full pass** over the cohortgenotype partition, materialises each arm's PKs into a Python list, and combines them as `Variant WHERE id IN (listA) OR id IN (listB)`.

The arms already share one join alias (`cohortgenotype_<cgc_pk>`), so in principle the same result could be produced in a **single pass** — exactly what a `CohortNode` does (one join, one OR/regex predicate over `samples_zygosity`). The duplicate passes are the optimisation target.

## Why it happens

`arg_q_dict` expresses *AND-of-keyed-filters*, applied one annotation alias at a time (`AnalysisNode.get_queryset`). A merge needs **OR across arms**, and:

- Each sample's zygosity filter is keyed under a *per-sample* alias (`sample_<pk>`, a `Substr` over the shared join), so distinct arms never share a key.
- The format has no representation for "OR across different annotation-dependent keys".

So `MergeNode._split_common_filters` can only OR-combine `None`-keyed (annotation-independent) filters; any arm with a non-`None` key falls back to `parent.get_queryset()` → materialise PKs → `Q(pk__in=...)`.

## Notes / scope

- **Applies to ancestors, not just direct parents.** `arg_q_dict` and its stable aliases propagate up through intermediate filter nodes unchanged, so the same-cohort arms are detectable however deep the `SampleNode` sits below the merge. This is why a fix keyed on *shared annotation join* generalises, whereas a fix keyed on *"are my parents SampleNodes"* would miss the (common) case of filtering on each arm before merging.
- **Only large arms matter.** Arms resolving to ≤ `ANALYSIS_NODE_STORE_ID_SIZE_MAX` (default 1000) are already collapsed to literal PK lists upstream (`get_small_parent_arg_q_dict`), so they neither benefit nor need the rewrite. The saving lands on large arms — which is also where the duplicate passes hurt most.
- **Gate on shared join, not blanket OR.** Combining into one pass is only a win when arms share the same cohortgenotype join. Cross-cohort arms would multiply LEFT JOINs under the OR and likely regress — they should stay on the current materialise-then-OR path (the existing worst-case-protection behaviour).

## Possible approaches

1. **Special-case same-cohort merge** — detect arms over one `CohortGenotypeCollection` and rebuild a cohort-style single-pass query. Simple-ish but a parallel query path that must re-handle quality/AF/gene-list/VCF/sub-cohort filters, and breaks on intermediate filter nodes if keyed on node class.
2. **General filter rewriting (preferred)** — extend the merge/`get_queryset` protocol so an OR-Q can reference multiple *already-annotated* aliases that share a join: annotate them all up front, then apply a single OR `.filter()`. Naturally handles ancestors and degrades safely to today's behaviour for distinct joins.

## Relevant code

- `analysis/models/nodes/filters/merge_node.py` — `_split_common_filters`, `_get_merged_q_dict`
- `analysis/models/nodes/analysis_node.py` — `get_queryset` (annotate-then-filter loop), `get_arg_q_dict`, `get_small_parent_arg_q_dict`
- `analysis/models/nodes/cohort_mixin.py`, `snpdb/models/models_vcf.py` (`Sample.get_cohort_genotype_alias_and_field`), `snpdb/models/models_cohort.py` (`cohortgenotype_alias`, `get_zygosity_q`)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Analysis MergeNode optimisations #1619

Overview

Why it happens

Notes / scope

Possible approaches

Relevant code

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Analysis MergeNode optimisations #1619

Description

Overview

Why it happens

Notes / scope

Possible approaches

Relevant code

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions