Skip to content

Draft: add graph fusion optimizer and qualify wider PDF stages#1846

Draft
jioffe502 wants to merge 4 commits intoNVIDIA:mainfrom
jioffe502:codex/fused-v1-table-graphic-qualification
Draft

Draft: add graph fusion optimizer and qualify wider PDF stages#1846
jioffe502 wants to merge 4 commits intoNVIDIA:mainfrom
jioffe502:codex/fused-v1-table-graphic-qualification

Conversation

@jioffe502
Copy link
Copy Markdown
Collaborator

Summary

This draft introduces graph fusion as an internal compile-time optimizer for the graph pipeline. Fusion remains behind enable_fusion; public execution modes stay batch and inprocess.

The current draft:

  • adds FusedOperator and compile-time graph rewriting for eligible linear chains
  • adds the ProcessOnlyFusionSafe contract for process-level fusion legality
  • wires enable_fusion through the graph executor, graph ingestor, example CLI, and harness
  • removes stale public run_mode="fused" surface so the API matches the actual execution model
  • adds explicit ray_object_store_memory_bytes plumbing for large-run Ray comparisons
  • qualifies page-elements, table-structure, graphic-elements, and OCR actors for wider linear fusion when present
  • fixes harness forwarding for table/graphic extraction flags so widened-chain experiments actually exercise the intended graph shape
  • adds focused compiler/executor/harness tests and widened-chain structural tests

Why This Shape

Fusion is implemented as an optimizer, not a new runtime backend.

  • batch and inprocess remain the execution backends
  • enable_fusion compiles the graph into a more efficient plan when legal
  • eligible chains are replaced with an internal synthetic node such as Fused[PageElementDetectionActor+OCRActor]
  • legality is explicit and capability-driven rather than inferred from stage names

This follows the standard pattern used by dataflow systems and compilers:

  • Flink operator chaining
  • Beam fusion
  • TensorRT build-time optimization
  • XLA / OpenXLA fusion
  • TVM fusion passes

Validation

Focused test coverage:

  • tests/test_pipeline_graph.py
  • tests/test_harness_config.py
  • tests/test_harness_run.py

Validated results:

  • bo20, dgx_8gpu, embed_batch_size=32

  • median ingest improved from 52.45s to 37.79s with PE->OCR fusion

  • semantics matched on pages, rows, and detection summary

  • bo767, dgx_8gpu, embed_batch_size=32, ray_object_store_memory_bytes=1000000000000

  • baseline ingest: 371.90s

  • fused ingest: 319.88s

  • pages and rows matched

Still In Draft

This PR stays draft because:

  • the widened table/graphic-qualified chain has structural coverage but still needs its real baseline/fused harness validation pair
  • the long quiet post-ingest recall/writeout tail is a separate dev-experience issue and is not resolved here

Follow-Up

Next validation steps:

  • run bo20 baseline/fused with use_table_structure=true and use_graphic_elements=true
  • if stable, repeat on bo767
  • compare fusion_summary, ingest time, rows, pages, and detection/recall behavior

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 14, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jioffe502 jioffe502 force-pushed the codex/fused-v1-table-graphic-qualification branch from cef5041 to 860d156 Compare April 20, 2026 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant