Skip to content

added pipeline sub command and update docs#1874

Open
jperez999 wants to merge 3 commits intoNVIDIA:mainfrom
jperez999:docs-replace
Open

added pipeline sub command and update docs#1874
jperez999 wants to merge 3 commits intoNVIDIA:mainfrom
jperez999:docs-replace

Conversation

@jperez999
Copy link
Copy Markdown
Collaborator

Description

This PR turns the graph_pipeline.py file into retriever subcommand pipeline. Now you can you retriever pipeline to run the exact same logic as graph_pipeline.py

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@jperez999 jperez999 self-assigned this Apr 17, 2026
@jperez999 jperez999 requested review from a team as code owners April 17, 2026 20:38
@jperez999 jperez999 requested a review from jdye64 April 17, 2026 20:38
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 17, 2026

Greptile Summary

This PR promotes graph_pipeline.py into a proper retriever pipeline subcommand by introducing nemo_retriever/pipeline/__main__.py (the Typer app) and __init__.py (lazy re-export), converting the original script into a thin backward-compat shim, and adding a suite of CLI docs under nemo_retriever/docs/cli/. The source code is well-structured, the _ensure_lancedb_table exception narrowing is correct, and the unit tests cover all helpers thoroughly.

  • P1 — broken flag name in docs: Multiple doc files (retriever_cli.md, pdf-split-tuning.md, quickstart.md, cli-client-usage.md) reference --pdf-split-batch, which is not a registered CLI option — the actual flag is --pdf-split-batch-size (line 711 of __main__.py). Users following any of these examples will get Error: No such option: --pdf-split-batch immediately.

Confidence Score: 4/5

Safe to merge once the --pdf-split-batch--pdf-split-batch-size documentation fix is applied across all affected files.

One P1 finding: the --pdf-split-batch flag name used in documentation examples across four files does not exist in the CLI — only --pdf-split-batch-size does. Any user running the documented commands verbatim will get a Typer error. All other findings are resolved or P2.

nemo_retriever/docs/cli/retriever_cli.md, nemo_retriever/docs/cli/pdf-split-tuning.md, nemo_retriever/docs/cli/quickstart.md, nemo_retriever/docs/cli/cli-client-usage.md — all need --pdf-split-batch replaced with --pdf-split-batch-size.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/pipeline/main.py New Typer CLI module for retriever pipeline run; well-structured with good helper decomposition; ValueError catch in _ensure_lancedb_table is narrowed correctly; run() command is ~300 lines (above the 50-line SRP guidance) but is the CLI entry point
nemo_retriever/src/nemo_retriever/pipeline/init.py Minimal package init; lazy attribute forwarding for app/run is clean and avoids circular imports
nemo_retriever/src/nemo_retriever/adapters/cli/main.py Adds pipeline sub-app by importing __main__ directly (consistent with existing html, online, txt patterns); safe change
nemo_retriever/src/nemo_retriever/examples/graph_pipeline.py Thin backward-compat shim re-exporting app from the new pipeline.__main__; clean migration approach
nemo_retriever/tests/test_pipeline_helpers.py Good unit test coverage of all helper functions; tests use spec-safe fakes, parametrize edge cases, and cover happy paths + error paths
nemo_retriever/docs/cli/retriever_cli.md Uses non-existent --pdf-split-batch flag in multiple examples (should be --pdf-split-batch-size); also contains previously-flagged incorrect module path
nemo_retriever/docs/cli/pdf-split-tuning.md Treats --pdf-split-batch and --pdf-split-batch-size as two distinct flags (line 14), but only --pdf-split-batch-size exists; all --pdf-split-batch references need to be corrected
nemo_retriever/docs/cli/README.md New README mapping old nv-ingest-cli examples to retriever subcommands; content is accurate and well-organized
nemo_retriever/docs/cli/quickstart.md References non-existent --pdf-split-batch flag on line 81; otherwise accurate

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[retriever pipeline run INPUT_PATH] --> B[_configure_logging]
    B --> C[_resolve_file_patterns]
    C --> D{input_type?}
    D -->|pdf/doc| E[_build_extract_params]
    D -->|txt/html| F[TextChunkParams]
    D -->|image| G[ExtractParams for images]
    D -->|audio| H[AudioChunkParams + asr_params]
    E --> I[_build_embed_params]
    F --> I
    G --> I
    H --> I
    I --> J[_build_ingestor GraphIngestor + stages]
    J --> K[ingestor.ingest]
    K --> L[_collect_results batch: take_all / inprocess: DataFrame]
    L --> M{save_intermediate?}
    M -->|yes| N[Write Parquet]
    M -->|no| O[handle_lancedb]
    N --> O
    O --> P{table empty?}
    P -->|yes| Q[_write_runtime_summary no eval]
    P -->|no| R[_run_evaluation recall or BEIR]
    R --> S[_write_runtime_summary]
    S --> T[print_run_summary]
    Q --> U[ray.shutdown if batch]
    T --> U
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/docs/cli/retriever_cli.md
Line: 152-156

Comment:
**`--pdf-split-batch` does not exist — should be `--pdf-split-batch-size`**

The flag name `--pdf-split-batch` used in this code block and in the surrounding prose (lines 80, 156) does not match the option registered in `pipeline/__main__.py` (line 711: `"--pdf-split-batch-size"`). Running the example as-written will produce `Error: No such option: --pdf-split-batch`. The same mismatch appears in `quickstart.md` (line 81), `pdf-split-tuning.md` (lines 14, 37, 41, 43, 53, 64), and `cli-client-usage.md` (line 102). All occurrences need to be replaced with `--pdf-split-batch-size`.

```suggestion
  --extract-text --no-extract-tables --no-extract-charts \
  --pdf-split-batch-size 64 \
  --save-intermediate ./processed_docs
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (2): Last reviewed commit: "fix exception on table open" | Re-trigger Greptile

Comment on lines +152 to +154
return
except Exception:
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Silent swallow of all open_table exceptions

except Exception: pass catches every possible failure — including permissions errors, corrupted database state, network issues with remote LanceDB — and silently falls through to db.create_table(). If open_table fails for any reason other than "table doesn't exist", the subsequent create_table call will likely fail too, but with a confusing error that hides the root cause. Catch the specific lancedb exception for a missing table instead.

Suggested change
return
except Exception:
pass
try:
db.open_table(table_name)
return
except Exception as e:
if "not found" not in str(e).lower():
raise

Or, if lancedb exposes a typed exception (e.g. lancedb.exceptions.TableNotFoundError), use that directly.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/pipeline/__main__.py
Line: 152-154

Comment:
**Silent swallow of all `open_table` exceptions**

`except Exception: pass` catches every possible failure — including permissions errors, corrupted database state, network issues with remote LanceDB — and silently falls through to `db.create_table()`. If `open_table` fails for any reason other than "table doesn't exist", the subsequent `create_table` call will likely fail too, but with a confusing error that hides the root cause. Catch the specific lancedb exception for a missing table instead.

```suggestion
    try:
        db.open_table(table_name)
        return
    except Exception as e:
        if "not found" not in str(e).lower():
            raise
```

Or, if lancedb exposes a typed exception (e.g. `lancedb.exceptions.TableNotFoundError`), use that directly.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread nemo_retriever/src/nemo_retriever/pipeline/__main__.py
Comment thread nemo_retriever/docs/cli/retriever_cli.md
Comment on lines +183 to +206
def _resolve_file_patterns(input_path: Path, input_type: str) -> list[str]:
input_path = Path(input_path)
if input_path.is_file():
return [str(input_path)]
if not input_path.is_dir():
raise typer.BadParameter(f"Path does not exist: {input_path}")

ext_map = {
"pdf": ["*.pdf"],
"doc": ["*.docx", "*.pptx"],
"txt": ["*.txt"],
"html": ["*.html"],
"image": ["*.jpg", "*.jpeg", "*.png", "*.tiff", "*.bmp"],
"audio": ["*.mp3", "*.wav", "*.m4a"],
}
exts = ext_map.get(input_type)
if exts is None:
raise typer.BadParameter(f"Unsupported --input-type: {input_type!r}")

patterns = [str(input_path / ext) for ext in exts]
matched = [p for p in patterns if _glob.glob(p)]
if not matched:
raise typer.BadParameter(f"No files found for input_type={input_type!r} in {input_path}")
return matched
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Non-recursive directory glob may surprise users

str(input_path / ext) produces patterns like /data/pdfs/*.pdf, which only match files directly in the specified directory — not in subdirectories. If a user passes a nested corpus directory, subdirectory files will be silently skipped with no warning (the BadParameter is only raised when the top level has zero matches).

Consider using **/*.pdf with recursive=True, or at minimum log the resolved patterns at DEBUG level so users can verify what was picked up.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/pipeline/__main__.py
Line: 183-206

Comment:
**Non-recursive directory glob may surprise users**

`str(input_path / ext)` produces patterns like `/data/pdfs/*.pdf`, which only match files directly in the specified directory — not in subdirectories. If a user passes a nested corpus directory, subdirectory files will be silently skipped with no warning (the `BadParameter` is only raised when the top level has zero matches).

Consider using `**/*.pdf` with `recursive=True`, or at minimum log the resolved patterns at DEBUG level so users can verify what was picked up.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +152 to +156
--pdf-split-batch 64 \
--save-intermediate ./processed_docs
```

**Parity note.** There is no v1/v2 concept in `retriever`; `--pdf-split-batch`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 --pdf-split-batch does not exist — should be --pdf-split-batch-size

The flag name --pdf-split-batch used in this code block and in the surrounding prose (lines 80, 156) does not match the option registered in pipeline/__main__.py (line 711: "--pdf-split-batch-size"). Running the example as-written will produce Error: No such option: --pdf-split-batch. The same mismatch appears in quickstart.md (line 81), pdf-split-tuning.md (lines 14, 37, 41, 43, 53, 64), and cli-client-usage.md (line 102). All occurrences need to be replaced with --pdf-split-batch-size.

Suggested change
--pdf-split-batch 64 \
--save-intermediate ./processed_docs
```
**Parity note.** There is no v1/v2 concept in `retriever`; `--pdf-split-batch`
--extract-text --no-extract-tables --no-extract-charts \
--pdf-split-batch-size 64 \
--save-intermediate ./processed_docs
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/docs/cli/retriever_cli.md
Line: 152-156

Comment:
**`--pdf-split-batch` does not exist — should be `--pdf-split-batch-size`**

The flag name `--pdf-split-batch` used in this code block and in the surrounding prose (lines 80, 156) does not match the option registered in `pipeline/__main__.py` (line 711: `"--pdf-split-batch-size"`). Running the example as-written will produce `Error: No such option: --pdf-split-batch`. The same mismatch appears in `quickstart.md` (line 81), `pdf-split-tuning.md` (lines 14, 37, 41, 43, 53, 64), and `cli-client-usage.md` (line 102). All occurrences need to be replaced with `--pdf-split-batch-size`.

```suggestion
  --extract-text --no-extract-tables --no-extract-charts \
  --pdf-split-batch-size 64 \
  --save-intermediate ./processed_docs
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant