harness: stabilize embed OOM tuning and defaults by jioffe502 · Pull Request #1823 · NVIDIA/NeMo-Retriever

jioffe502 · 2026-04-08T18:41:53Z

TLDR

This PR hardens embedding stability by separating two different control knobs:

embed_batch_size for Ray Data transport/scheduling, and
embed_inference_batch_size for local model forward-pass VRAM pressure.

It keeps the change focused on core runtime behavior: decoupled tuning controls plus adaptive OOM retry.

Problem

During bo767 validation, we observed long-sequence tail batches driving CUDA OOM-like failures and throughput volatility. The key discovery is that one batch knob is doing two jobs:

scheduler/transport sizing (embed_batch_size), and
model microbatch sizing (inference_batch_size).

That coupling makes stability tuning hard, especially when sequence lengths are heterogeneous.

What Changed

1) Decoupled controls in harness/CLI

Added embed_inference_batch_size to harness config/defaults/validation.
Added CLI wiring so harness passes --embed-inference-batch-size through to runtime.
Bound embed params to prefer embed_inference_batch_size over embed_batch_size for model forward-pass batching.

2) Adaptive OOM stabilization in local embed runtime

On CUDA OOM-like errors, automatically reduce local embed microbatch and retry.
After successful streaks, recover batch size gradually toward target.
Emit warnings with char/token diagnostics for failing chunks.

3) Runtime parity + operational controls

Added Ray object-store sizing support via RAY_DEFAULT_OBJECT_STORE_MEMORY_PROPORTION / RAY_OBJECT_STORE_MEMORY_BYTES.
Forwarded runtime env/logging controls into Ray init path.

Quantitative Evidence (bo767)

All runs below use embed_inference_batch_size=32.

Run	gpu_embed	ingest PPS	recall@5	Artifact
bo767_20260406_193158_UTC	1.0	152.99	0.8345	`nemo_retriever/artifacts/bo767_20260406_193158_UTC/results.json`
bo767_20260406_202408_UTC	1.0	153.56	0.8365	`nemo_retriever/artifacts/bo767_20260406_202408_UTC/results.json`
bo767_20260406_204929_UTC	1.0	151.22	0.8325	`nemo_retriever/artifacts/bo767_20260406_204929_UTC/results.json`
bo767_20260408_154723_UTC	1.0	146.45	0.8315	`nemo_retriever/artifacts/bo767_20260408_154723_UTC/results.json`
bo767_20260408_160637_UTC	1.0	146.05	0.8365	`nemo_retriever/artifacts/bo767_20260408_160637_UTC/results.json`
bo767_20260408_172412_UTC	1.0	146.53	0.8335	`nemo_retriever/artifacts/bo767_20260408_172412_UTC/results.json`
bo767_20260408_161941_UTC	0.25	160.26	0.8365	`nemo_retriever/artifacts/bo767_20260408_161941_UTC/results.json`
bo767_20260408_173706_UTC	0.25	155.54	0.8355	`nemo_retriever/artifacts/bo767_20260408_173706_UTC/results.json`

Key Takeaways

Decoupled microbatch control (embed_inference_batch_size=32) is stable and repeatable across runs.
Recall stays in a tight band (~0.833–0.836 recall@5).
Throughput is strongly sensitive to gpu_embed policy in these validations:
- gpu_embed=1.0 cohort: ~146 PPS
- gpu_embed=0.25 cohort: >150 PPS
Adaptive OOM handling keeps runs progressing through long-tail batches instead of hard failing.

Recommendation

Merge this as the system-level OOM stabilization/control layer.
Keep default embed_inference_batch_size=32 in harness presets.
Treat gpu_embed as an explicit policy knob for follow-up performance decisioning (isolation vs throughput).

Why This Is Safe

Additive changes around tuning and runtime behavior; no API removals.
Retry logic activates only on OOM-like failures.
Conservative default microbatch (32) aligns with observed stability.
Harness/graph/resource tests pass.

When To Override

Increase embed_inference_batch_size only with sustained VRAM headroom and zero OOM retry pressure.
Decrease it for long-sequence-heavy corpora or increasing OOM retry frequency.
Tune gpu_embed according to target objective:
- stability isolation: closer to 1.0
- throughput: evaluate lower fractions (e.g., 0.25) with recall guardrails.

Rollout

Merge with embed_inference_batch_size=32 defaults.
Validate bo767/jp20 in nightly.
Track OOM retry frequency and recall@5 drift.
Finalize gpu_embed performance policy in a dedicated follow-up decision.

Rollback

Revert this commit to restore previous runtime behavior.
Immediate mitigation without code rollback: override embed_inference_batch_size and/or gpu_embed via harness CLI.

Open Questions For Lead Decision

Which gpu_embed policy should define productized performance baselines?
Should OOM retry events be informational telemetry only, or a gating signal for stability-required runs?

Test Plan

uv run pytest tests/test_harness_config.py tests/test_harness_run.py tests/test_resource_heuristics.py
uv run pytest tests/test_ingest_interface.py tests/test_graph_pipeline_registry.py
uv run pytest tests/test_create_local_embedder.py tests/test_multimodal_embed.py tests/test_operator_flags_and_cpu_actors.py
bo767 repeated harness validations with embed_inference_batch_size=32 (artifacts listed above)

Port embed OOM hardening and decoupled inference microbatch controls into the graph-era harness path so bo767 runs can be tuned for stability without conflating Ray scheduling batch size. Capture OOM outlier metadata for attribution, restore graph Ray init parity knobs, and document the refactor/validation context for lead review. Signed-off-by: jioffe502 <jioffe@nvidia.com>

copy-pr-bot · 2026-04-08T18:41:58Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-04-08T18:48:06Z

Greptile Summary

This PR decouples embedding transport batch size (embed_batch_size) from model inference microbatch size (embed_inference_batch_size), adds an adaptive CUDA OOM retry loop in the local embedder, and wires Ray object-store memory sizing through GraphIngestor. All remaining findings are P2 style/documentation items; the core runtime logic is additive and backward-compatible.

Confidence Score: 5/5

Safe to merge — changes are additive, retry logic activates only on OOM-like failures, and quantitative bo767 evidence supports stability.

All remaining findings are P2: a stale comment, missing YAML preset entries, an overly broad OOM-detection token, and a docstring gap. None block correctness or safety. Prior critical concerns (I/O in OOM handler, silent exception swallow, stale adaptive state, indexing landmine) were addressed in earlier review threads.

No files require special attention for merge safety; the embedder file carries the most new complexity but is well-tested by the bo767 run artifacts.

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_embedder.py	Core change: adds adaptive OOM retry loop in `_embed_local` with `_adaptive_batch_size` actor state, plus diagnostic helpers. `CUDNN_STATUS_INTERNAL_ERROR` in the OOM-detection set is broader than pure allocation failures (P2).
nemo_retriever/src/nemo_retriever/harness/config.py	Adds `embed_inference_batch_size: int = 32` to `HarnessConfig` and `TUNING_FIELDS`; validation and env-var wiring are consistent with the existing pattern.
nemo_retriever/src/nemo_retriever/harness/run.py	Correctly forwards `--embed-inference-batch-size` to the subprocess command; no issues found.
nemo_retriever/src/nemo_retriever/graph_ingestor.py	Adds `_resolve_object_store_memory_bytes()` for Ray object-store sizing and forwards `ray_log_to_driver`/`debug` into Ray init; new public constructor params are missing from the class docstring.
nemo_retriever/src/nemo_retriever/examples/graph_pipeline.py	Exposes `--embed-inference-batch-size` CLI flag and wires it into `EmbedParams`; fallback to `embed_batch_size` preserves backward compatibility for callers that don't set the new flag.
nemo_retriever/src/nemo_retriever/utils/ray_resource_hueristics.py	No functional changes; `EMBED_BATCH_SIZE` comment now inaccurately describes the constant as driving both transport and inference batch sizing after the decoupling introduced in this PR.
nemo_retriever/harness/test_configs.yaml	Only two of eight presets carry `embed_inference_batch_size: 32`; the six VL presets fall back to the `HarnessConfig` dataclass default, making the YAML incomplete as self-documentation for operators tuning those presets.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["embed(texts, batch_size)"] --> B["_embed_local()"]
    B --> C["Sort texts by length (longest first)"]
    C --> D["current_bs = min(target_bs, _adaptive_batch_size)"]
    D --> E{"i < len(sorted_texts)?"}
    E -- No --> F["torch.cat(outs)"]
    F --> G["Reorder to original indices"]
    G --> H["Return CPU tensor"]
    E -- Yes --> I["chunk = texts[i : i + current_bs]"]
    I --> J["_embed_chunk(chunk)"]
    J -- Success --> K["outs.append(result)\ni += current_bs"]
    K --> L{"current_bs < target_bs?"}
    L -- Yes --> M["success_streak += 1"]
    M --> N{"streak >= 3?"}
    N -- Yes --> O["current_bs = min(target_bs, current_bs * 2)\n_adaptive_batch_size = current_bs"]
    N -- No --> E
    O --> E
    L -- No --> P["_adaptive_batch_size = target_bs"]
    P --> E
    J -- "torch.cuda.OutOfMemoryError\nor OOM-like RuntimeError" --> Q["torch.cuda.empty_cache()\nsuccess_streak = 0"]
    Q --> R{"current_bs <= 1?"}
    R -- Yes --> S["Re-raise"]
    R -- No --> T["current_bs = current_bs // 2\n_adaptive_batch_size = current_bs\nwarn with diagnostics"]
    T --> E

Prompt To Fix All With AI

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/utils/ray_resource_hueristics.py
Line: 25

Comment:
**Stale comment after decoupling**

`EMBED_BATCH_SIZE` is now only the Ray Data transport/scheduling batch size; the model inference microbatch is controlled separately by `embed_inference_batch_size`. The trailing comment incorrectly implies this constant drives both knobs, which is the coupling this PR was designed to remove.

```suggestion
EMBED_BATCH_SIZE = 256  # Ray Data transport/scheduling batch size (inference microbatch is controlled separately)
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_embedder.py
Line: 24-34

Comment:
**`CUDNN_STATUS_INTERNAL_ERROR` is broader than OOM**

`CUDNN_STATUS_INTERNAL_ERROR` is a generic cuDNN failure covering workspace allocation failures but also unsupported operations, invalid tensor shapes, and hardware faults. Including it here means any unexpected cuDNN error (not just memory pressure) will silently trigger batch-halving retries. If the root cause is not VRAM pressure, each retry at a reduced batch size will also fail, burning through 7+ retry cycles before eventually re-raising — with misleading "retrying with batch_size=…" diagnostics. Consider tightening this to tokens that more specifically indicate allocation failures.

```suggestion
def _is_cuda_oom_like_error(exc: RuntimeError) -> bool:
    msg = str(exc).upper()
    return any(
        token in msg
        for token in (
            "CUDA OUT OF MEMORY",
            "CUBLAS_STATUS_ALLOC_FAILED",
            "CUDNN_STATUS_ALLOC_FAILED",
        )
    )
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/harness/test_configs.yaml
Line: 77-118

Comment:
**`embed_inference_batch_size` missing from most presets**

Only `PE_GE_OCR_TE_DENSE` and `PE_GE_OCR_TE_HYBRID` carry the new key. The six VL presets (`PE_GE_OCR_VL_*`) rely on the `HarnessConfig` dataclass default (`32`), but an operator reading these YAML files to tune a VL run won't know that knob exists or what its effective value is. Adding `embed_inference_batch_size: 32` to all presets keeps the YAML self-documenting and makes future tuning explicit.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/graph_ingestor.py
Line: 172-199

Comment:
**New public constructor params missing from docstring**

`ray_log_to_driver` and `debug` were added as public constructor parameters in this PR but are not documented in the class docstring. Per the `docstrings-public-interface` standard, every public parameter should have a description entry.

Add to the class docstring Parameters section:

```
ray_log_to_driver
    Forward Ray worker logs to the driver process (default True).
debug
    Enable DEBUG-level logging in Ray runtime workers (default False).
```

**Rule Used:** Public modules, classes, and functions must have d... ([source](https://app.greptile.com/review/custom-context?memory=docstrings-public-interface))

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (5): Last reviewed commit: "Merge branch 'main' into feature/embed-o..." | Re-trigger Greptile}

greptile-apps · 2026-04-08T18:48:10Z

+        capture_path = Path(capture_path_raw).expanduser()
+        capture_is_dir = capture_path_raw.endswith("/") or (capture_path.exists() and capture_path.is_dir())
+        if capture_is_dir:
+            capture_path.mkdir(parents=True, exist_ok=True)
+            capture_file = capture_path / f"embed_oom_outliers_pid{os.getpid()}.jsonl"
+        else:
+            capture_path.parent.mkdir(parents=True, exist_ok=True)
+            capture_file = capture_path
+
+        line = json.dumps(event, ensure_ascii=True) + "\n"
+        with capture_file.open("a", encoding="utf-8") as f:


Unguarded I/O in _capture_oom_outlier_event can abort OOM retry

_capture_oom_outlier_event is called from inside except torch.cuda.OutOfMemoryError: and except RuntimeError: blocks where the intent is to halve the batch size and retry. If capture_path.mkdir() or capture_file.open(...) raises (disk full, permission denied, race on directory creation) the new I/O exception propagates out of the except-handler, replacing the OOM and killing the embedding job instead of retrying. The whole function must be wrapped in a top-level try/except so that any failure is silently dropped rather than breaking recovery.

try: capture_path = Path(capture_path_raw).expanduser() capture_is_dir = capture_path_raw.endswith("/") or (capture_path.exists() and capture_path.is_dir()) if capture_is_dir: capture_path.mkdir(parents=True, exist_ok=True) capture_file = capture_path / f"embed_oom_outliers_pid{os.getpid()}.jsonl" else: capture_path.parent.mkdir(parents=True, exist_ok=True) capture_file = capture_path line = json.dumps(event, ensure_ascii=True) + "\n" with capture_file.open("a", encoding="utf-8") as f: try: import fcntl fcntl.flock(f.fileno(), fcntl.LOCK_EX) f.write(line) f.flush() fcntl.flock(f.fileno(), fcntl.LOCK_UN) except Exception: f.write(line) f.flush() self._oom_capture_event_count += 1 except Exception as capture_exc: logger.debug("OOM capture write failed (non-fatal): %s", capture_exc)

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_embedder.py Line: 271-281 Comment: **Unguarded I/O in `_capture_oom_outlier_event` can abort OOM retry** `_capture_oom_outlier_event` is called from inside `except torch.cuda.OutOfMemoryError:` and `except RuntimeError:` blocks where the intent is to halve the batch size and retry. If `capture_path.mkdir()` or `capture_file.open(...)` raises (disk full, permission denied, race on directory creation) the new I/O exception propagates out of the except-handler, replacing the OOM and killing the embedding job instead of retrying. The whole function must be wrapped in a top-level `try/except` so that any failure is silently dropped rather than breaking recovery. ```python try: capture_path = Path(capture_path_raw).expanduser() capture_is_dir = capture_path_raw.endswith("/") or (capture_path.exists() and capture_path.is_dir()) if capture_is_dir: capture_path.mkdir(parents=True, exist_ok=True) capture_file = capture_path / f"embed_oom_outliers_pid{os.getpid()}.jsonl" else: capture_path.parent.mkdir(parents=True, exist_ok=True) capture_file = capture_path line = json.dumps(event, ensure_ascii=True) + "\n" with capture_file.open("a", encoding="utf-8") as f: try: import fcntl fcntl.flock(f.fileno(), fcntl.LOCK_EX) f.write(line) f.flush() fcntl.flock(f.fileno(), fcntl.LOCK_UN) except Exception: f.write(line) f.flush() self._oom_capture_event_count += 1 except Exception as capture_exc: logger.debug("OOM capture write failed (non-fatal): %s", capture_exc) ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-08T18:48:17Z

+    try:
+        tokenized = tokenizer(
+            list(chunk),
+            padding=False,
+            truncation=True,
+            max_length=max(1, int(max_length)),
+            return_length=True,
+        )
+        token_lengths = tokenized.get("length")
+    except Exception:
+        token_lengths = None


Silent exception swallow in tokenization path

The bare except Exception: on the tokenizer call in _batch_length_summary swallows tokenization errors without logging. Since this helper is called during OOM recovery to build diagnostic information, a silent failure means the warning message emitted to the user will have tok_max=None, tok_p95=None with no indication that tokenization itself failed. Consider adding a logger.debug(...) at minimum so OOM diagnostic information gaps are traceable.

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_embedder.py Line: 83-93 Comment: **Silent exception swallow in tokenization path** The bare `except Exception:` on the tokenizer call in `_batch_length_summary` swallows tokenization errors without logging. Since this helper is called during OOM recovery to build diagnostic information, a silent failure means the warning message emitted to the user will have `tok_max=None, tok_p95=None` with no indication that tokenization itself failed. Consider adding a `logger.debug(...)` at minimum so OOM diagnostic information gaps are traceable. How can I resolve this? If you propose a fix, please make it concise.

Drop per-item metadata and JSONL OOM outlier capture paths from the local embed runtime while retaining adaptive OOM retry and decoupled inference batch controls. This keeps the PR focused on core stability behavior and reduces instrumentation overhead. Signed-off-by: jioffe502 <jioffe@nvidia.com>

greptile-apps · 2026-04-08T19:11:56Z


        outs: List[torch.Tensor] = []
+        target_bs = max(1, int(batch_size))
+        current_bs = min(target_bs, self._adaptive_batch_size) if self._adaptive_batch_size is not None else target_bs


Adaptive batch size reads stale state across calls

self._adaptive_batch_size is actor-level state that persists between embed() calls. On a fresh call with a new (larger) batch_size argument, line 224 initialises current_bs = min(target_bs, self._adaptive_batch_size), which silently caps the first batch at whatever the worst historical OOM level was, even if VRAM pressure has since been relieved. The streak-based growth inside the loop never propagates back across the call boundary. Consider resetting when target_bs exceeds the stored value, or document the cross-call persistence explicitly.

current_bs = ( min(target_bs, self._adaptive_batch_size) if self._adaptive_batch_size is not None and self._adaptive_batch_size < target_bs else target_bs )

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_embedder.py Line: 224 Comment: **Adaptive batch size reads stale state across calls** `self._adaptive_batch_size` is actor-level state that persists between `embed()` calls. On a fresh call with a new (larger) `batch_size` argument, line 224 initialises `current_bs = min(target_bs, self._adaptive_batch_size)`, which silently caps the first batch at whatever the worst historical OOM level was, even if VRAM pressure has since been relieved. The streak-based growth inside the loop never propagates back across the call boundary. Consider resetting when `target_bs` exceeds the stored value, or document the cross-call persistence explicitly. ```python current_bs = ( min(target_bs, self._adaptive_batch_size) if self._adaptive_batch_size is not None and self._adaptive_batch_size < target_bs else target_bs ) ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-08T19:11:57Z

+        if not outs:
+            return torch.empty((0, 0), dtype=torch.float32)
+
+        sorted_embeddings = torch.cat(outs, dim=0)
+        reordered_embeddings: List[torch.Tensor | None] = [None] * len(sorted_to_original)
+        for sorted_idx, original_idx in enumerate(sorted_to_original):
+            reordered_embeddings[original_idx] = sorted_embeddings[sorted_idx]
+        if any(emb is None for emb in reordered_embeddings):
+            raise RuntimeError("Failed to reconstruct embedding order after length sorting.")
+        return torch.stack([emb for emb in reordered_embeddings if emb is not None], dim=0)


Length mismatch between sorted_embeddings and sorted_to_original when OOM leaves partial results

If OOM is raised when current_bs <= 1 (re-raised, not retried), execution exits the while loop with outs containing embeddings for only the texts processed before the fatal OOM. torch.cat(outs) produces a tensor shorter than len(sorted_to_original), and the loop for sorted_idx, original_idx in enumerate(sorted_to_original) will raise IndexError when sorted_idx exceeds that length. The if any(emb is None …) guard never executes because the IndexError fires first. Currently unreachable because the fatal re-raise exits before this code, but it is a correctness landmine for future refactors. Add an explicit length pre-check before the indexing loop.

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/model/local/llama_nemotron_embed_1b_v2_embedder.py Line: 284-293 Comment: **Length mismatch between `sorted_embeddings` and `sorted_to_original` when OOM leaves partial results** If OOM is raised when `current_bs <= 1` (re-raised, not retried), execution exits the while loop with `outs` containing embeddings for only the texts processed before the fatal OOM. `torch.cat(outs)` produces a tensor shorter than `len(sorted_to_original)`, and the loop `for sorted_idx, original_idx in enumerate(sorted_to_original)` will raise `IndexError` when `sorted_idx` exceeds that length. The `if any(emb is None …)` guard never executes because the IndexError fires first. Currently unreachable because the fatal re-raise exits before this code, but it is a correctness landmine for future refactors. Add an explicit length pre-check before the indexing loop. How can I resolve this? If you propose a fix, please make it concise.

jioffe502 · 2026-04-09T13:53:32Z

DGX bo767 sweep evidence

Ran the dedicated 6-run dgx_8gpu matrix on bo767 with:

embed_inference_batch_size in {32,64,96}
gpu_embed in {0.25,1.0}
fixed embed_batch_size=256, use_heuristics=false

Observed results:

`embed_inference_batch_size`	`gpu_embed=0.25`	`gpu_embed=1.0`	Delta vs `1.0`
`32`	`158.68 PPS`, `recall@5=0.8355`	`147.72 PPS`, `recall@5=0.8335`	`+10.96 PPS` (`+7.42%`)
`64`	`162.26 PPS`, `recall@5=0.8355`	`141.94 PPS`, `recall@5=0.8355`	`+20.32 PPS` (`+14.32%`)
`96`	`158.00 PPS`, `recall@5=0.8355`	`135.93 PPS`, `recall@5=0.8375`	`+22.07 PPS` (`+16.24%`)

gpu_embed=0.25 beat gpu_embed=1.0 at every batch size with effectively flat recall@5.
Best observed throughput was mb64,gpu_embed=0.25 at 162.26 PPS.
mb64,gpu=0.25 is only 2.26% faster than mb32,gpu=0.25, so mb32 remains a plausible conservative fallback.
mb96 should not be the default. It is slower than mb64 under both GPU policies and still showed OOM-retry behavior in the finished session output.
Historical gpu_embed=1.0 controls show the same shape: mb32=152.99, mb64=145.45, mb96=145.64 PPS.

Recommendation:

Default gpu_embed=0.25.
If OOM retries are telemetry-only, default embed_inference_batch_size=64.
If we want a more conservative default, use embed_inference_batch_size=32 and document 64 as the throughput override.
Do not default 96.

jioffe502 · 2026-04-10T16:51:18Z

bo767 repeat sweep follow-up at `gpu_embed=0.25`

Ran a 40-run DGX bo767 sweep with gpu_embed=0.25 fixed and embed_inference_batch_size in {1,32,64,256}, repeated 10x per setting.

`embed_inference_batch_size`	Passes	Mean PPS	PPS stdev	PPS range	Mean `recall@5`	`recall@5` stdev	OOM evidence
`1`	`10/10`	`152.82`	`1.38`	`151.46-156.59`	`0.8355`	`0.0006`	none
`32`	`10/10`	`159.34`	`1.40`	`156.94-162.11`	`0.8354`	`0.0014`	none
`64`	`10/10`	`158.64`	`1.67`	`156.68-163.34`	`0.8309`	`0.0108`	none
`256`	`9/10`	`158.51`*	`1.59`*	`156.12-161.49`*	`0.8124`*	`0.0246`*	`1` failed run, `8` OOM retries in the failed run

* computed over the 9 successful mb=256 runs only.

Takeaways:

32 is the best default candidate in this repeat sweep: highest mean PPS, no OOM retries, and stable recall.
64 is within observed run-to-run noise on ingest (-0.70 PPS, -0.44% vs 32) and shows worse recall stability, so this does not support moving the default from 32 to 64.
1 is clearly slower (-6.52 PPS, -4.09% vs 32) with no stability upside.
256 is not a viable default candidate: one run failed, the failed run logged 8 CUDA OOM during embedding; retrying events, and the successful runs still showed materially worse recall stability.
This is directionally consistent with the current HF/local embedding path not realizing reliable end-to-end gains from larger microbatches in this workload.
The VDB tail is already de-bottlenecked (VDBUploadOperator batch_size=64), so pushing embed microbatch higher is not buying a downstream write-path win here; in practice it mainly adds memory pressure and instability.

Recommendation / next step:

If we want one default from the current evidence, this repeat sweep supports embed_inference_batch_size=32 at gpu_embed=0.25.
From here, I think there are two reasonable paths:
1. Land this change with 32 as the default and treat broader SKU validation as follow-up work.
2. Keep testing in this PR, but narrow it to cross-SKU validation focused on 32 vs 64 rather than continuing to probe larger batch sizes.

jioffe502 requested review from a team as code owners April 8, 2026 18:41

jioffe502 requested a review from jperez999 April 8, 2026 18:41

jioffe502 changed the title ~~harness: Stabilize graph embed OOM tuning and defaults~~ harness: stabilize embed OOM tuning and defaults Apr 8, 2026

greptile-apps Bot reviewed Apr 8, 2026

View reviewed changes

jioffe502 marked this pull request as draft April 8, 2026 19:25

jioffe502 marked this pull request as ready for review April 10, 2026 20:52

jdye64 added 2 commits April 17, 2026 15:07

Merge branch 'main' into feature/embed-oom-stability-graph-port

13c3750

Merge branch 'main' into feature/embed-oom-stability-graph-port

5e3753a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

harness: stabilize embed OOM tuning and defaults#1823

harness: stabilize embed OOM tuning and defaults#1823
jioffe502 wants to merge 4 commits intoNVIDIA:mainfrom
jioffe502:feature/embed-oom-stability-graph-port

jioffe502 commented Apr 8, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 8, 2026

Uh oh!

greptile-apps Bot commented Apr 8, 2026 •

edited

Loading

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 8, 2026

Uh oh!

greptile-apps Bot Apr 8, 2026

Uh oh!

greptile-apps Bot Apr 8, 2026

Uh oh!

greptile-apps Bot Apr 8, 2026

Uh oh!

jioffe502 commented Apr 9, 2026 •

edited

Loading

Uh oh!

jioffe502 commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jioffe502 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TLDR

Problem

What Changed

1) Decoupled controls in harness/CLI

2) Adaptive OOM stabilization in local embed runtime

3) Runtime parity + operational controls

Quantitative Evidence (bo767)

Key Takeaways

Recommendation

Why This Is Safe

When To Override

Rollout

Rollback

Open Questions For Lead Decision

Test Plan

Uh oh!

copy-pr-bot Bot commented Apr 8, 2026

Uh oh!

greptile-apps Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

jioffe502 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DGX bo767 sweep evidence

Uh oh!

jioffe502 commented Apr 10, 2026

bo767 repeat sweep follow-up at gpu_embed=0.25

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jioffe502 commented Apr 8, 2026 •

edited

Loading

greptile-apps Bot commented Apr 8, 2026 •

edited

Loading

jioffe502 commented Apr 9, 2026 •

edited

Loading

bo767 repeat sweep follow-up at `gpu_embed=0.25`