Async ray executor by jdye64 · Pull Request #1877 · NVIDIA/NeMo-Retriever

jdye64 · 2026-04-20T15:31:55Z

Please review and merge #1860 first

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

greptile-apps · 2026-04-20T17:54:08Z

Greptile Summary

This PR converts the Ray operator pipeline from synchronous __call__ to async by adding aprocess/arun/async __call__ through the AbstractOperator and ArchetypeOperator hierarchy, along with async variants of the core OCR and chart inference functions (aocr_page_elements, agraphic_elements_ocr_page_elements, anemotron_parse_page_elements). Several issues from the prior review round have been addressed: the invasive DefaultEventLoopPolicy reset is replaced with the safer _ensure_event_loop(), the _BatchEmbedActor alias is replaced with a proper class BatchEmbedActor, and all test helpers are unified behind the _run() utility in testing_utils.py.

Confidence Score: 4/5

Safe to merge once the remaining print() calls in new async functions are replaced with logger.warning()

The prior round P0/P1 concerns (policy reset, _BatchEmbedActor naming, get_event_loop deprecation, use_table_structure drop) have all been addressed or were already flagged. The only new finding is P2: print() in aocr_page_elements and async nemotron parse in ocr/shared.py, plus the still-open chart/shared.py print() from the prior thread. All remaining issues are style-level and do not affect runtime correctness.

nemo_retriever/src/nemo_retriever/ocr/shared.py (print() in new async functions); nemo_retriever/src/nemo_retriever/chart/shared.py (print() in agraphic_elements_ocr_page_elements — noted in prior thread)

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/graph/abstract_operator.py	Replaced invasive DefaultEventLoopPolicy reset with a safe _ensure_event_loop() helper; call is now async (breaking change already flagged in prior thread)
nemo_retriever/src/nemo_retriever/graph/operator_archetype.py	Added aprocess/arun/async call delegation to resolved delegate; clean implementation
nemo_retriever/src/nemo_retriever/text_embed/operators.py	BatchEmbedActor is now a proper top-level class (not _BatchEmbedActor alias), fixing name for node lookups; backward-compat getattr shim retained
nemo_retriever/src/nemo_retriever/ocr/shared.py	New aocr_page_elements and async nemotron parse functions use print() for errors (should be logger.warning); use_table_structure silently dropped in remote async path (flagged in prior thread)
nemo_retriever/src/nemo_retriever/chart/shared.py	agraphic_elements_ocr_page_elements still uses print() for all three error paths (flagged in prior thread)
nemo_retriever/tests/testing_utils.py	New _run() helper cleanly handles both coroutines and plain values using asyncio.new_event_loop(); fixes deprecated asyncio.get_event_loop() pattern across all test files
nemo_retriever/src/nemo_retriever/graph/ingestor_runtime.py	build_graph / build_inprocess_graph alias unchanged; BatchEmbedActor reference now correct after rename
nemo_retriever/src/nemo_retriever/operators/embedding.py	Thin re-export module; correctly delegates to text_embed.operators.BatchEmbedActor

Class Diagram

%%{init: {'theme': 'neutral'}}%%
classDiagram
    class AbstractOperator {
        +preprocess(data) Any
        +process(data) Any
        +postprocess(data) Any
        +run(data) Any
        +aprocess(data) coroutine
        +arun(data) coroutine
        +__call__(data) coroutine
        +get_constructor_kwargs() dict
    }
    class ArchetypeOperator {
        +_cpu_variant_class
        +_gpu_variant_class
        +resolve_operator_class() type
        +aprocess(data) coroutine
        +arun(data) coroutine
        +__call__(data) coroutine
        -_resolve_delegate() AbstractOperator
    }
    class BatchEmbedActor {
        +prefers_cpu_variant() bool
        +cpu_variant_class() type
        +gpu_variant_class() type
    }
    class BatchEmbedCPUActor
    class BatchEmbedGPUActor

    AbstractOperator <|-- ArchetypeOperator
    ArchetypeOperator <|-- BatchEmbedActor
    AbstractOperator <|-- BatchEmbedCPUActor
    AbstractOperator <|-- BatchEmbedGPUActor
    BatchEmbedActor ..> BatchEmbedCPUActor : resolves to
    BatchEmbedActor ..> BatchEmbedGPUActor : resolves to

Prompt To Fix All With AI

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/ocr/shared.py
Line: 1222-1223

Comment:
**`print()` in new async OCR and parse functions — violates no-print-statements rule**

The new `aocr_page_elements` function (and the async version of `anemotron_parse_page_elements` at line ~1411) both use `print()` for error reporting. All surrounding sync functions and other actors use `logger.warning()`. Since `aocr_page_elements` is called from Ray actor workers, these messages will bypass structured log handlers and cannot be filtered, captured, or routed to observability tooling.

```suggestion
        except BaseException as e:
            logger.warning("OCR failed: %s: %s", type(e).__name__, e)
```

**Rule Used:** Use the Python logging module (import logging; log... ([source](https://app.greptile.com/review/custom-context?memory=no-print-statements))

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (5): Last reviewed commit: "Address review comments" | Re-trigger Greptile}

charlesbluca · 2026-04-21T15:25:30Z

+        ``asyncio.to_thread`` would allow Ray Data to run multiple
+        batches in parallel threads inside the same actor, causing
+        memory corruption.


Is this true? Claude flagged that we're using asyncio.to_thread in our local GPU page-elements implementation, which implies this memory corruption claim is dubious

Do we know for sure asyncio.to_thread is not causing those types of memory issues in the GPU page-elements implementation?

charlesbluca · 2026-04-21T15:57:28Z

Should we also apply these changes to NemotronParseCPUActor?

jperez999 · 2026-04-21T20:05:26Z

    # Internal cache for local HF embedders, keyed by model name.
    _embedder_cache: dict = field(default_factory=dict, init=False, repr=False, compare=False)

+    def __str__(self) -> str:


where and how is this used?

Just an overloaded Python str function to get the string representation of the object itself. I use it in other things external to this codebase itself

greptile-apps · 2026-04-22T01:11:36Z

Want your agent to iterate on Greptile's feedback? Try greploops.

greptile-apps · 2026-04-22T01:11:39Z

+                    blocks = _parse_ocr_result(preds)
+                    if label_name == "table":
+                        crop_hw_table: Tuple[int, int] = (0, 0)
+                        try:
+                            _raw = base64.b64decode(crop_b64s[i])
+                            with Image.open(io.BytesIO(_raw)) as _cim:
+                                _cw, _ch = _cim.size
+                                crop_hw_table = (_ch, _cw)
+                        except Exception:
+                            pass
+                        text = _blocks_to_pseudo_markdown(blocks, crop_hw=crop_hw_table) or _blocks_to_text(blocks)


use_table_structure silently dropped in async remote OCR path

aocr_page_elements has no use_table_structure parameter and its remote-path table handling (line 1199–1209) goes directly to _blocks_to_pseudo_markdown, skipping the _find_ts_detections_for_bbox / join_table_structure_and_ocr_output logic that the sync counterpart executes when use_table_structure=True. OCRCPUActor stores use_table_structure in self.ocr_kwargs and passes it via **kwargs to aocr_page_elements, but the async remote branch never reads it. Result: table-structure enrichment is silently absent for all remote-mode async invocations even though it is requested via the actor's config.

The non-remote branch correctly delegates to asyncio.to_thread(ocr_page_elements, ...) which preserves the logic, making the behavior inconsistent depending on whether invoke_url is set.

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/ocr/shared.py Line: 1199-1209 Comment: **`use_table_structure` silently dropped in async remote OCR path** `aocr_page_elements` has no `use_table_structure` parameter and its remote-path table handling (line 1199–1209) goes directly to `_blocks_to_pseudo_markdown`, skipping the `_find_ts_detections_for_bbox` / `join_table_structure_and_ocr_output` logic that the sync counterpart executes when `use_table_structure=True`. `OCRCPUActor` stores `use_table_structure` in `self.ocr_kwargs` and passes it via `**kwargs` to `aocr_page_elements`, but the async remote branch never reads it. Result: table-structure enrichment is silently absent for all remote-mode async invocations even though it is requested via the actor's config. The non-remote branch correctly delegates to `asyncio.to_thread(ocr_page_elements, ...)` which preserves the logic, making the behavior inconsistent depending on whether `invoke_url` is set. How can I resolve this? If you propose a fix, please make it concise.

jdye64 added 4 commits April 16, 2026 19:01

First pass at async ray operators

9e42bd0

First pass at async ray operators

c9b31d9

First pass at async ray operators

d4aff4c

upmerge

a2097aa

jdye64 marked this pull request as ready for review April 20, 2026 17:49

jdye64 requested review from a team as code owners April 20, 2026 17:49

jdye64 requested a review from drobison00 April 20, 2026 17:49

greptile-apps Bot reviewed Apr 20, 2026

View reviewed changes

jdye64 added 4 commits April 20, 2026 18:11

Address Greptile issues

ba3bce6

Merge branch 'main' into async-ray-executor

35adb6b

Unit test fixes

d3d753a

Merge branch 'main' into async-ray-executor

00eba6e

charlesbluca reviewed Apr 21, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/graph/abstract_operator.py Outdated

charlesbluca reviewed Apr 21, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/text_embed/operators.py

charlesbluca reviewed Apr 21, 2026

View reviewed changes

Comment thread nemo_retriever/tests/test_asr_actor.py Outdated

charlesbluca reviewed Apr 21, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/graph/abstract_operator.py Outdated

jperez999 approved these changes Apr 21, 2026

View reviewed changes

Address review comments

f3885c2

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

Address review comments

c61c1f6

Conversation

jdye64 commented Apr 20, 2026

Checklist

Uh oh!

greptile-apps Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Class Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

charlesbluca Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

jdye64 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

charlesbluca commented Apr 21, 2026

Uh oh!

Uh oh!

Uh oh!

jperez999 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

jdye64 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 22, 2026

Uh oh!

greptile-apps Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented Apr 20, 2026 •

edited

Loading