-
Notifications
You must be signed in to change notification settings - Fork 218
Description
Summary
AUTO mode no longer hits 20s timeouts on the current benchmark set, but dense multi-clause non-adj predicates can still push pair growth toward quadratic. We should track a DF-native mitigation before declaring worst-case safe.
Repro
Branch/commit: feat/where-clause-executor @ 9127f9e0 (pandas/CPU).
Example pattern (matches nonadj_multi in benchmarks/run_chain_vs_samepath.py):
g.chain([
n(name="a"), e_forward(name="e1"), n(name="b"),
e_reverse(name="e2"), n(name="c")
]).where([
compare(col("a", "v_mod10"), "==", col("c", "v_mod10")),
compare(col("a", "v"), "<", col("c", "v")),
])Run:
GRAPHISTRY_NON_ADJ_WHERE_MODE=auto \
GRAPHISTRY_NON_ADJ_WHERE_DOMAIN_SEMIJOIN_AUTO=1 \
GRAPHISTRY_EDGE_WHERE_SEMIJOIN_AUTO=1 \
PYTHONPATH=. python benchmarks/run_chain_vs_samepath.py \
--runs 3 --warmup 1 --engine pandas \
--graph-filter large_dense \
--scenario-filter nonadj_multi --seed 42Observed (approx): large_dense (100k nodes / 500k edges) is ~0.9–1.0s and still ~1.6–1.8x slower than regular; 3-hop multi-eq ~1.1–1.2s. This is acceptable today but still asymptotically risky as density/hops grow or predicates get less selective.
Why this matters
Even with semijoin gating + dedup reductions + cached edge-pairs, multi-clause non-adj WHERE still materializes large join intermediates. On denser graphs or additional hops, we can still see superlinear growth.
DF-native ideas to investigate
- Adjacency-aware semijoin that remains DataFrame-only (e.g., pre-group/merge vs full scan)
- Mid-node intersection prefiltering only when it shrinks (needs robust gating)
- Stats/NDV-driven clause ordering or gating (avoid expensive joins early)
- Factorized/grouped representations to avoid cross-product materialization
- Additional dense multi-clause benchmarks to catch regressions
Current mitigations already in place
- AUTO semijoin gating based on pair estimates
- Reduced dedup overhead in semijoin path
- Cached edge-pairs per edge when allowed_edges unset
This issue tracks the remaining asymptotic risk for AUTO WHERE.