added pipeline sub command and update docs#1874
added pipeline sub command and update docs#1874jperez999 wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR promotes
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/pipeline/main.py | New Typer CLI module for retriever pipeline run; well-structured with good helper decomposition; ValueError catch in _ensure_lancedb_table is narrowed correctly; run() command is ~300 lines (above the 50-line SRP guidance) but is the CLI entry point |
| nemo_retriever/src/nemo_retriever/pipeline/init.py | Minimal package init; lazy attribute forwarding for app/run is clean and avoids circular imports |
| nemo_retriever/src/nemo_retriever/adapters/cli/main.py | Adds pipeline sub-app by importing __main__ directly (consistent with existing html, online, txt patterns); safe change |
| nemo_retriever/src/nemo_retriever/examples/graph_pipeline.py | Thin backward-compat shim re-exporting app from the new pipeline.__main__; clean migration approach |
| nemo_retriever/tests/test_pipeline_helpers.py | Good unit test coverage of all helper functions; tests use spec-safe fakes, parametrize edge cases, and cover happy paths + error paths |
| nemo_retriever/docs/cli/retriever_cli.md | Uses non-existent --pdf-split-batch flag in multiple examples (should be --pdf-split-batch-size); also contains previously-flagged incorrect module path |
| nemo_retriever/docs/cli/pdf-split-tuning.md | Treats --pdf-split-batch and --pdf-split-batch-size as two distinct flags (line 14), but only --pdf-split-batch-size exists; all --pdf-split-batch references need to be corrected |
| nemo_retriever/docs/cli/README.md | New README mapping old nv-ingest-cli examples to retriever subcommands; content is accurate and well-organized |
| nemo_retriever/docs/cli/quickstart.md | References non-existent --pdf-split-batch flag on line 81; otherwise accurate |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[retriever pipeline run INPUT_PATH] --> B[_configure_logging]
B --> C[_resolve_file_patterns]
C --> D{input_type?}
D -->|pdf/doc| E[_build_extract_params]
D -->|txt/html| F[TextChunkParams]
D -->|image| G[ExtractParams for images]
D -->|audio| H[AudioChunkParams + asr_params]
E --> I[_build_embed_params]
F --> I
G --> I
H --> I
I --> J[_build_ingestor GraphIngestor + stages]
J --> K[ingestor.ingest]
K --> L[_collect_results batch: take_all / inprocess: DataFrame]
L --> M{save_intermediate?}
M -->|yes| N[Write Parquet]
M -->|no| O[handle_lancedb]
N --> O
O --> P{table empty?}
P -->|yes| Q[_write_runtime_summary no eval]
P -->|no| R[_run_evaluation recall or BEIR]
R --> S[_write_runtime_summary]
S --> T[print_run_summary]
Q --> U[ray.shutdown if batch]
T --> U
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/docs/cli/retriever_cli.md
Line: 152-156
Comment:
**`--pdf-split-batch` does not exist — should be `--pdf-split-batch-size`**
The flag name `--pdf-split-batch` used in this code block and in the surrounding prose (lines 80, 156) does not match the option registered in `pipeline/__main__.py` (line 711: `"--pdf-split-batch-size"`). Running the example as-written will produce `Error: No such option: --pdf-split-batch`. The same mismatch appears in `quickstart.md` (line 81), `pdf-split-tuning.md` (lines 14, 37, 41, 43, 53, 64), and `cli-client-usage.md` (line 102). All occurrences need to be replaced with `--pdf-split-batch-size`.
```suggestion
--extract-text --no-extract-tables --no-extract-charts \
--pdf-split-batch-size 64 \
--save-intermediate ./processed_docs
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (2): Last reviewed commit: "fix exception on table open" | Re-trigger Greptile
| return | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Silent swallow of all
open_table exceptions
except Exception: pass catches every possible failure — including permissions errors, corrupted database state, network issues with remote LanceDB — and silently falls through to db.create_table(). If open_table fails for any reason other than "table doesn't exist", the subsequent create_table call will likely fail too, but with a confusing error that hides the root cause. Catch the specific lancedb exception for a missing table instead.
| return | |
| except Exception: | |
| pass | |
| try: | |
| db.open_table(table_name) | |
| return | |
| except Exception as e: | |
| if "not found" not in str(e).lower(): | |
| raise |
Or, if lancedb exposes a typed exception (e.g. lancedb.exceptions.TableNotFoundError), use that directly.
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/pipeline/__main__.py
Line: 152-154
Comment:
**Silent swallow of all `open_table` exceptions**
`except Exception: pass` catches every possible failure — including permissions errors, corrupted database state, network issues with remote LanceDB — and silently falls through to `db.create_table()`. If `open_table` fails for any reason other than "table doesn't exist", the subsequent `create_table` call will likely fail too, but with a confusing error that hides the root cause. Catch the specific lancedb exception for a missing table instead.
```suggestion
try:
db.open_table(table_name)
return
except Exception as e:
if "not found" not in str(e).lower():
raise
```
Or, if lancedb exposes a typed exception (e.g. `lancedb.exceptions.TableNotFoundError`), use that directly.
How can I resolve this? If you propose a fix, please make it concise.| def _resolve_file_patterns(input_path: Path, input_type: str) -> list[str]: | ||
| input_path = Path(input_path) | ||
| if input_path.is_file(): | ||
| return [str(input_path)] | ||
| if not input_path.is_dir(): | ||
| raise typer.BadParameter(f"Path does not exist: {input_path}") | ||
|
|
||
| ext_map = { | ||
| "pdf": ["*.pdf"], | ||
| "doc": ["*.docx", "*.pptx"], | ||
| "txt": ["*.txt"], | ||
| "html": ["*.html"], | ||
| "image": ["*.jpg", "*.jpeg", "*.png", "*.tiff", "*.bmp"], | ||
| "audio": ["*.mp3", "*.wav", "*.m4a"], | ||
| } | ||
| exts = ext_map.get(input_type) | ||
| if exts is None: | ||
| raise typer.BadParameter(f"Unsupported --input-type: {input_type!r}") | ||
|
|
||
| patterns = [str(input_path / ext) for ext in exts] | ||
| matched = [p for p in patterns if _glob.glob(p)] | ||
| if not matched: | ||
| raise typer.BadParameter(f"No files found for input_type={input_type!r} in {input_path}") | ||
| return matched |
There was a problem hiding this comment.
Non-recursive directory glob may surprise users
str(input_path / ext) produces patterns like /data/pdfs/*.pdf, which only match files directly in the specified directory — not in subdirectories. If a user passes a nested corpus directory, subdirectory files will be silently skipped with no warning (the BadParameter is only raised when the top level has zero matches).
Consider using **/*.pdf with recursive=True, or at minimum log the resolved patterns at DEBUG level so users can verify what was picked up.
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/pipeline/__main__.py
Line: 183-206
Comment:
**Non-recursive directory glob may surprise users**
`str(input_path / ext)` produces patterns like `/data/pdfs/*.pdf`, which only match files directly in the specified directory — not in subdirectories. If a user passes a nested corpus directory, subdirectory files will be silently skipped with no warning (the `BadParameter` is only raised when the top level has zero matches).
Consider using `**/*.pdf` with `recursive=True`, or at minimum log the resolved patterns at DEBUG level so users can verify what was picked up.
How can I resolve this? If you propose a fix, please make it concise.| --pdf-split-batch 64 \ | ||
| --save-intermediate ./processed_docs | ||
| ``` | ||
|
|
||
| **Parity note.** There is no v1/v2 concept in `retriever`; `--pdf-split-batch` |
There was a problem hiding this comment.
--pdf-split-batch does not exist — should be --pdf-split-batch-size
The flag name --pdf-split-batch used in this code block and in the surrounding prose (lines 80, 156) does not match the option registered in pipeline/__main__.py (line 711: "--pdf-split-batch-size"). Running the example as-written will produce Error: No such option: --pdf-split-batch. The same mismatch appears in quickstart.md (line 81), pdf-split-tuning.md (lines 14, 37, 41, 43, 53, 64), and cli-client-usage.md (line 102). All occurrences need to be replaced with --pdf-split-batch-size.
| --pdf-split-batch 64 \ | |
| --save-intermediate ./processed_docs | |
| ``` | |
| **Parity note.** There is no v1/v2 concept in `retriever`; `--pdf-split-batch` | |
| --extract-text --no-extract-tables --no-extract-charts \ | |
| --pdf-split-batch-size 64 \ | |
| --save-intermediate ./processed_docs |
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/docs/cli/retriever_cli.md
Line: 152-156
Comment:
**`--pdf-split-batch` does not exist — should be `--pdf-split-batch-size`**
The flag name `--pdf-split-batch` used in this code block and in the surrounding prose (lines 80, 156) does not match the option registered in `pipeline/__main__.py` (line 711: `"--pdf-split-batch-size"`). Running the example as-written will produce `Error: No such option: --pdf-split-batch`. The same mismatch appears in `quickstart.md` (line 81), `pdf-split-tuning.md` (lines 14, 37, 41, 43, 53, 64), and `cli-client-usage.md` (line 102). All occurrences need to be replaced with `--pdf-split-batch-size`.
```suggestion
--extract-text --no-extract-tables --no-extract-charts \
--pdf-split-batch-size 64 \
--save-intermediate ./processed_docs
```
How can I resolve this? If you propose a fix, please make it concise.
Description
This PR turns the graph_pipeline.py file into retriever subcommand pipeline. Now you can you
retriever pipelineto run the exact same logic as graph_pipeline.pyChecklist