Draft: add graph fusion optimizer and qualify wider PDF stages by jioffe502 · Pull Request #1846 · NVIDIA/NeMo-Retriever

jioffe502 · 2026-04-14T18:23:22Z

Summary

This draft introduces graph fusion as an internal compile-time optimizer for the graph pipeline. Fusion remains behind enable_fusion; public execution modes stay batch and inprocess.

The current draft:

adds FusedOperator and compile-time graph rewriting for eligible linear chains
adds the ProcessOnlyFusionSafe contract for process-level fusion legality
wires enable_fusion through the graph executor, graph ingestor, example CLI, and harness
removes stale public run_mode="fused" surface so the API matches the actual execution model
adds explicit ray_object_store_memory_bytes plumbing for large-run Ray comparisons
qualifies page-elements, table-structure, graphic-elements, and OCR actors for wider linear fusion when present
fixes harness forwarding for table/graphic extraction flags so widened-chain experiments actually exercise the intended graph shape
adds focused compiler/executor/harness tests and widened-chain structural tests

Why This Shape

Fusion is implemented as an optimizer, not a new runtime backend.

batch and inprocess remain the execution backends
enable_fusion compiles the graph into a more efficient plan when legal
eligible chains are replaced with an internal synthetic node such as Fused[PageElementDetectionActor+OCRActor]
legality is explicit and capability-driven rather than inferred from stage names

This follows the standard pattern used by dataflow systems and compilers:

Flink operator chaining
Beam fusion
TensorRT build-time optimization
XLA / OpenXLA fusion
TVM fusion passes

Validation

Focused test coverage:

tests/test_pipeline_graph.py
tests/test_harness_config.py
tests/test_harness_run.py

Validated results:

bo20, dgx_8gpu, embed_batch_size=32
median ingest improved from 52.45s to 37.79s with PE->OCR fusion
semantics matched on pages, rows, and detection summary
bo767, dgx_8gpu, embed_batch_size=32, ray_object_store_memory_bytes=1000000000000
baseline ingest: 371.90s
fused ingest: 319.88s
pages and rows matched

Still In Draft

This PR stays draft because:

the widened table/graphic-qualified chain has structural coverage but still needs its real baseline/fused harness validation pair
the long quiet post-ingest recall/writeout tail is a separate dev-experience issue and is not resolved here

Follow-Up

Next validation steps:

run bo20 baseline/fused with use_table_structure=true and use_graphic_elements=true
if stable, repeat on bo767
compare fusion_summary, ingest time, rows, pages, and detection/recall behavior

copy-pr-bot · 2026-04-14T18:23:26Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

jioffe502 added 3 commits April 20, 2026 15:09

Add graph fusion v1 and Ray object store sizing

9ab4721

Qualify table and graphic stages for fusion

b86610e

Preserve fused batch call semantics

0f19c14

jioffe502 force-pushed the codex/fused-v1-table-graphic-qualification branch from cef5041 to 860d156 Compare April 20, 2026 16:18

Adapt fusion branch to current upstream

860d156

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: add graph fusion optimizer and qualify wider PDF stages#1846

Draft: add graph fusion optimizer and qualify wider PDF stages#1846
jioffe502 wants to merge 4 commits intoNVIDIA:mainfrom
jioffe502:codex/fused-v1-table-graphic-qualification

jioffe502 commented Apr 14, 2026

Uh oh!

copy-pr-bot Bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jioffe502 commented Apr 14, 2026

Summary

Why This Shape

Validation

Still In Draft

Follow-Up

Uh oh!

copy-pr-bot Bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant