Skip to content

AUTO WHERE: dense multi-clause non-adj still risks asymptotic blowups #904

@lmeyerov

Description

@lmeyerov

Summary

AUTO mode no longer hits 20s timeouts on the current benchmark set, but dense multi-clause non-adj predicates can still push pair growth toward quadratic. We should track a DF-native mitigation before declaring worst-case safe.

Repro

Branch/commit: feat/where-clause-executor @ 9127f9e0 (pandas/CPU).

Example pattern (matches nonadj_multi in benchmarks/run_chain_vs_samepath.py):

g.chain([
  n(name="a"), e_forward(name="e1"), n(name="b"),
  e_reverse(name="e2"), n(name="c")
]).where([
  compare(col("a", "v_mod10"), "==", col("c", "v_mod10")),
  compare(col("a", "v"), "<", col("c", "v")),
])

Run:

GRAPHISTRY_NON_ADJ_WHERE_MODE=auto \
GRAPHISTRY_NON_ADJ_WHERE_DOMAIN_SEMIJOIN_AUTO=1 \
GRAPHISTRY_EDGE_WHERE_SEMIJOIN_AUTO=1 \
PYTHONPATH=. python benchmarks/run_chain_vs_samepath.py \
  --runs 3 --warmup 1 --engine pandas \
  --graph-filter large_dense \
  --scenario-filter nonadj_multi --seed 42

Observed (approx): large_dense (100k nodes / 500k edges) is ~0.9–1.0s and still ~1.6–1.8x slower than regular; 3-hop multi-eq ~1.1–1.2s. This is acceptable today but still asymptotically risky as density/hops grow or predicates get less selective.

Why this matters

Even with semijoin gating + dedup reductions + cached edge-pairs, multi-clause non-adj WHERE still materializes large join intermediates. On denser graphs or additional hops, we can still see superlinear growth.

DF-native ideas to investigate

  • Adjacency-aware semijoin that remains DataFrame-only (e.g., pre-group/merge vs full scan)
  • Mid-node intersection prefiltering only when it shrinks (needs robust gating)
  • Stats/NDV-driven clause ordering or gating (avoid expensive joins early)
  • Factorized/grouped representations to avoid cross-product materialization
  • Additional dense multi-clause benchmarks to catch regressions

Current mitigations already in place

  • AUTO semijoin gating based on pair estimates
  • Reduced dedup overhead in semijoin path
  • Cached edge-pairs per edge when allowed_edges unset

This issue tracks the remaining asymptotic risk for AUTO WHERE.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions