added pipeline sub command and update docs by jperez999 · Pull Request #1874 · NVIDIA/NeMo-Retriever

jperez999 · 2026-04-17T20:38:41Z

Description

This PR turns the graph_pipeline.py file into retriever subcommand pipeline. Now you can you retriever pipeline to run the exact same logic as graph_pipeline.py

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

greptile-apps · 2026-04-17T20:42:11Z

Greptile Summary

This PR promotes graph_pipeline.py into a proper retriever pipeline subcommand by introducing nemo_retriever/pipeline/__main__.py (the Typer app) and __init__.py (lazy re-export), converting the original script into a thin backward-compat shim, and adding a suite of CLI docs under nemo_retriever/docs/cli/. The source code is well-structured, the _ensure_lancedb_table exception narrowing is correct, and the unit tests cover all helpers thoroughly.

P1 — broken flag name in docs: Multiple doc files (retriever_cli.md, pdf-split-tuning.md, quickstart.md, cli-client-usage.md) reference --pdf-split-batch, which is not a registered CLI option — the actual flag is --pdf-split-batch-size (line 711 of __main__.py). Users following any of these examples will get Error: No such option: --pdf-split-batch immediately.

Confidence Score: 4/5

Safe to merge once the --pdf-split-batch → --pdf-split-batch-size documentation fix is applied across all affected files.

One P1 finding: the --pdf-split-batch flag name used in documentation examples across four files does not exist in the CLI — only --pdf-split-batch-size does. Any user running the documented commands verbatim will get a Typer error. All other findings are resolved or P2.

nemo_retriever/docs/cli/retriever_cli.md, nemo_retriever/docs/cli/pdf-split-tuning.md, nemo_retriever/docs/cli/quickstart.md, nemo_retriever/docs/cli/cli-client-usage.md — all need --pdf-split-batch replaced with --pdf-split-batch-size.

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/pipeline/main.py	New Typer CLI module for `retriever pipeline run`; well-structured with good helper decomposition; `ValueError` catch in `_ensure_lancedb_table` is narrowed correctly; `run()` command is ~300 lines (above the 50-line SRP guidance) but is the CLI entry point
nemo_retriever/src/nemo_retriever/pipeline/init.py	Minimal package init; lazy attribute forwarding for `app`/`run` is clean and avoids circular imports
nemo_retriever/src/nemo_retriever/adapters/cli/main.py	Adds `pipeline` sub-app by importing `__main__` directly (consistent with existing `html`, `online`, `txt` patterns); safe change
nemo_retriever/src/nemo_retriever/examples/graph_pipeline.py	Thin backward-compat shim re-exporting `app` from the new `pipeline.__main__`; clean migration approach
nemo_retriever/tests/test_pipeline_helpers.py	Good unit test coverage of all helper functions; tests use spec-safe fakes, parametrize edge cases, and cover happy paths + error paths
nemo_retriever/docs/cli/retriever_cli.md	Uses non-existent `--pdf-split-batch` flag in multiple examples (should be `--pdf-split-batch-size`); also contains previously-flagged incorrect module path
nemo_retriever/docs/cli/pdf-split-tuning.md	Treats `--pdf-split-batch` and `--pdf-split-batch-size` as two distinct flags (line 14), but only `--pdf-split-batch-size` exists; all `--pdf-split-batch` references need to be corrected
nemo_retriever/docs/cli/README.md	New README mapping old `nv-ingest-cli` examples to `retriever` subcommands; content is accurate and well-organized
nemo_retriever/docs/cli/quickstart.md	References non-existent `--pdf-split-batch` flag on line 81; otherwise accurate

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[retriever pipeline run INPUT_PATH] --> B[_configure_logging]
    B --> C[_resolve_file_patterns]
    C --> D{input_type?}
    D -->|pdf/doc| E[_build_extract_params]
    D -->|txt/html| F[TextChunkParams]
    D -->|image| G[ExtractParams for images]
    D -->|audio| H[AudioChunkParams + asr_params]
    E --> I[_build_embed_params]
    F --> I
    G --> I
    H --> I
    I --> J[_build_ingestor GraphIngestor + stages]
    J --> K[ingestor.ingest]
    K --> L[_collect_results batch: take_all / inprocess: DataFrame]
    L --> M{save_intermediate?}
    M -->|yes| N[Write Parquet]
    M -->|no| O[handle_lancedb]
    N --> O
    O --> P{table empty?}
    P -->|yes| Q[_write_runtime_summary no eval]
    P -->|no| R[_run_evaluation recall or BEIR]
    R --> S[_write_runtime_summary]
    S --> T[print_run_summary]
    Q --> U[ray.shutdown if batch]
    T --> U

Prompt To Fix All With AI

This is a comment left during a code review.
Path: nemo_retriever/docs/cli/retriever_cli.md
Line: 152-156

Comment:
**`--pdf-split-batch` does not exist — should be `--pdf-split-batch-size`**

The flag name `--pdf-split-batch` used in this code block and in the surrounding prose (lines 80, 156) does not match the option registered in `pipeline/__main__.py` (line 711: `"--pdf-split-batch-size"`). Running the example as-written will produce `Error: No such option: --pdf-split-batch`. The same mismatch appears in `quickstart.md` (line 81), `pdf-split-tuning.md` (lines 14, 37, 41, 43, 53, 64), and `cli-client-usage.md` (line 102). All occurrences need to be replaced with `--pdf-split-batch-size`.

```suggestion
  --extract-text --no-extract-tables --no-extract-charts \
  --pdf-split-batch-size 64 \
  --save-intermediate ./processed_docs
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (2): Last reviewed commit: "fix exception on table open" | Re-trigger Greptile}

greptile-apps · 2026-04-17T20:42:15Z

+        return
+    except Exception:
+        pass


Silent swallow of all open_table exceptions

except Exception: pass catches every possible failure — including permissions errors, corrupted database state, network issues with remote LanceDB — and silently falls through to db.create_table(). If open_table fails for any reason other than "table doesn't exist", the subsequent create_table call will likely fail too, but with a confusing error that hides the root cause. Catch the specific lancedb exception for a missing table instead.

Suggested change

return

except Exception:

pass

try:

db.open_table(table_name)

return

except Exception as e:

if "not found" not in str(e).lower():

raise

Or, if lancedb exposes a typed exception (e.g. lancedb.exceptions.TableNotFoundError), use that directly.

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/pipeline/__main__.py Line: 152-154 Comment: **Silent swallow of all `open_table` exceptions** `except Exception: pass` catches every possible failure — including permissions errors, corrupted database state, network issues with remote LanceDB — and silently falls through to `db.create_table()`. If `open_table` fails for any reason other than "table doesn't exist", the subsequent `create_table` call will likely fail too, but with a confusing error that hides the root cause. Catch the specific lancedb exception for a missing table instead. ```suggestion try: db.open_table(table_name) return except Exception as e: if "not found" not in str(e).lower(): raise ``` Or, if lancedb exposes a typed exception (e.g. `lancedb.exceptions.TableNotFoundError`), use that directly. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-17T20:42:18Z

+def _resolve_file_patterns(input_path: Path, input_type: str) -> list[str]:
+    input_path = Path(input_path)
+    if input_path.is_file():
+        return [str(input_path)]
+    if not input_path.is_dir():
+        raise typer.BadParameter(f"Path does not exist: {input_path}")
+
+    ext_map = {
+        "pdf": ["*.pdf"],
+        "doc": ["*.docx", "*.pptx"],
+        "txt": ["*.txt"],
+        "html": ["*.html"],
+        "image": ["*.jpg", "*.jpeg", "*.png", "*.tiff", "*.bmp"],
+        "audio": ["*.mp3", "*.wav", "*.m4a"],
+    }
+    exts = ext_map.get(input_type)
+    if exts is None:
+        raise typer.BadParameter(f"Unsupported --input-type: {input_type!r}")
+
+    patterns = [str(input_path / ext) for ext in exts]
+    matched = [p for p in patterns if _glob.glob(p)]
+    if not matched:
+        raise typer.BadParameter(f"No files found for input_type={input_type!r} in {input_path}")
+    return matched


Non-recursive directory glob may surprise users

str(input_path / ext) produces patterns like /data/pdfs/*.pdf, which only match files directly in the specified directory — not in subdirectories. If a user passes a nested corpus directory, subdirectory files will be silently skipped with no warning (the BadParameter is only raised when the top level has zero matches).

Consider using **/*.pdf with recursive=True, or at minimum log the resolved patterns at DEBUG level so users can verify what was picked up.

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/pipeline/__main__.py Line: 183-206 Comment: **Non-recursive directory glob may surprise users** `str(input_path / ext)` produces patterns like `/data/pdfs/*.pdf`, which only match files directly in the specified directory — not in subdirectories. If a user passes a nested corpus directory, subdirectory files will be silently skipped with no warning (the `BadParameter` is only raised when the top level has zero matches). Consider using `**/*.pdf` with `recursive=True`, or at minimum log the resolved patterns at DEBUG level so users can verify what was picked up. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-20T19:58:08Z

+  --pdf-split-batch 64 \
+  --save-intermediate ./processed_docs
+```
+
+**Parity note.** There is no v1/v2 concept in `retriever`; `--pdf-split-batch`


--pdf-split-batch does not exist — should be --pdf-split-batch-size

The flag name --pdf-split-batch used in this code block and in the surrounding prose (lines 80, 156) does not match the option registered in pipeline/__main__.py (line 711: "--pdf-split-batch-size"). Running the example as-written will produce Error: No such option: --pdf-split-batch. The same mismatch appears in quickstart.md (line 81), pdf-split-tuning.md (lines 14, 37, 41, 43, 53, 64), and cli-client-usage.md (line 102). All occurrences need to be replaced with --pdf-split-batch-size.

Suggested change

--pdf-split-batch 64 \

--save-intermediate ./processed_docs

```

**Parity note.** There is no v1/v2 concept in `retriever`; `--pdf-split-batch`

--extract-text --no-extract-tables --no-extract-charts \

--pdf-split-batch-size 64 \

--save-intermediate ./processed_docs

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/docs/cli/retriever_cli.md Line: 152-156 Comment: **`--pdf-split-batch` does not exist — should be `--pdf-split-batch-size`** The flag name `--pdf-split-batch` used in this code block and in the surrounding prose (lines 80, 156) does not match the option registered in `pipeline/__main__.py` (line 711: `"--pdf-split-batch-size"`). Running the example as-written will produce `Error: No such option: --pdf-split-batch`. The same mismatch appears in `quickstart.md` (line 81), `pdf-split-tuning.md` (lines 14, 37, 41, 43, 53, 64), and `cli-client-usage.md` (line 102). All occurrences need to be replaced with `--pdf-split-batch-size`. ```suggestion --extract-text --no-extract-tables --no-extract-charts \ --pdf-split-batch-size 64 \ --save-intermediate ./processed_docs ``` How can I resolve this? If you propose a fix, please make it concise.

added pipeline sub command and update docs

839cb93

jperez999 self-assigned this Apr 17, 2026

jperez999 requested review from a team as code owners April 17, 2026 20:38

jperez999 requested a review from jdye64 April 17, 2026 20:38

greptile-apps Bot reviewed Apr 17, 2026

View reviewed changes

jperez999 added 2 commits April 20, 2026 12:54

merge in main

1523e27

fix exception on table open

6a1ed14

greptile-apps Bot reviewed Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added pipeline sub command and update docs#1874

added pipeline sub command and update docs#1874
jperez999 wants to merge 3 commits intoNVIDIA:mainfrom
jperez999:docs-replace

jperez999 commented Apr 17, 2026

Uh oh!

greptile-apps Bot commented Apr 17, 2026 •

edited

Loading

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 17, 2026

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot Apr 17, 2026

Uh oh!

greptile-apps Bot Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-        return
-    except Exception:
-        pass
+    try:
+        db.open_table(table_name)
+        return
+    except Exception as e:
+        if "not found" not in str(e).lower():
+            raise

Conversation

jperez999 commented Apr 17, 2026

Description

Checklist

Uh oh!

greptile-apps Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Apr 17, 2026 •

edited

Loading