Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions nemo_retriever/docs/cli/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Retriever CLI — Replacement Examples for `nv-ingest-cli`

This folder contains `retriever` command-line examples that deliver the same
end-user outcomes as the `nv-ingest-cli` examples in
`nv-ingest/docs/`, `nv-ingest/api/`, `nv-ingest/client/`, and `nv-ingest/deploy/`.

The original `nv-ingest-cli` documentation is **not removed** — these files sit
alongside it as a new-CLI counterpart you can link to or migrate to.

## Key shape difference

`nv-ingest-cli` is a **single command that talks to a running REST service on
`localhost:7670`** and composes work via repeated `--task extract|split|caption|embed|dedup|filter|udf`.

`retriever` is a **multi-subcommand Typer app**. Most of the old CLI examples
map to `retriever pipeline run INPUT_PATH`, which runs the graph pipeline
locally (in-process or via Ray) and writes results to LanceDB and, optionally,
to Parquet / object storage. Other subcommands cover focused tasks:

| Old intent | New subcommand |
|------------|----------------|
| Extract + embed + store a batch of documents | `retriever pipeline run` |
| Run an ad-hoc PDF extraction stage | `retriever pdf stage` |
| Run an HTML / text / audio / chart stage | `retriever html run`, `retriever txt run`, `retriever audio extract`, `retriever chart run` |
| Upload stage output to LanceDB | `retriever vector-store stage` |
| Query LanceDB + compute recall@k | `retriever recall vdb-recall` |
| Run a QA evaluation sweep | `retriever eval run` |
| Serve / submit to the online REST API | `retriever online serve` / `retriever online stream-pdf` |
| Benchmark stage throughput | `retriever benchmark {split,extract,audio-extract,page-elements,ocr,all}` |
| Benchmark orchestration | `retriever harness {run,sweep,nightly,summary,compare}` |

## Contents

| New file | Replaces example(s) in |
|----------|------------------------|
| [`retriever_cli.md`](retriever_cli.md) | `nv-ingest/docs/docs/extraction/nv-ingest_cli.md` and the rebranded mirror `cli-reference.md` |
| [`quickstart.md`](quickstart.md) | `nv-ingest/docs/docs/extraction/quickstart-guide.md` (the `nv-ingest-cli` section) |
| [`pdf-split-tuning.md`](pdf-split-tuning.md) | `nv-ingest/docs/docs/extraction/v2-api-guide.md` (CLI example) |
| [`smoke-test.md`](smoke-test.md) | `nv-ingest/api/api_tests/smoke_test.sh` |
| [`cli-client-usage.md`](cli-client-usage.md) | `nv-ingest/client/client_examples/examples/cli_client_usage.ipynb` |
| [`pdf-blueprint.md`](pdf-blueprint.md) | `nv-ingest/deploy/pdf-blueprint.ipynb` (CLI cell) |
| [`benchmarking.md`](benchmarking.md) | `nv-ingest/docs/docs/extraction/benchmarking.md` and `nv-ingest/tools/harness/README.md` |

## Gaps with no retriever-CLI equivalent (kept out of this folder)

The following `nv-ingest-cli` examples are **not** migrated here because the
new CLI does not yet expose an equivalent — continue to use `nv-ingest-cli`
for these cases:

- `--task 'udf:{…}'` — user-defined functions
(`nv-ingest/docs/docs/extraction/user-defined-functions.md`,
`nv-ingest/examples/udfs/README.md`). `retriever` does not expose UDFs.
- `--task 'filter:{content_type:"image", min_size:…, min_aspect_ratio:…, max_aspect_ratio:…}'`.
The image scale/aspect-ratio filter stage is not reproduced in the new CLI.
- Bare service submission (`nv-ingest-cli --doc foo.pdf` with no extract tasks
and full content-type metadata returned by the service). `retriever online submit`
is currently a stub — only `retriever online stream-pdf` is implemented.
- `gen_dataset.py` dataset creation with enumeration and sampling.
- `--collect_profiling_traces --zipkin_host --zipkin_port`. Use
`--runtime-metrics-dir` / `--runtime-metrics-prefix` instead for a different
metrics flavor.

## Conventions used in the examples

- Input paths assume you invoke `retriever` from the `nv-ingest/nemo_retriever`
directory (or point at absolute paths).
- `--save-intermediate <dir>` writes the extraction DataFrame as Parquet for
inspection. LanceDB output goes to `--lancedb-uri` (defaults to `./lancedb`).
- `--store-images-uri <uri>` stores extracted images to a local path or an
fsspec URI (e.g. `s3://bucket/prefix`).
- `--run-mode inprocess` skips Ray and is ideal for single-file demos and CI;
`--run-mode batch` (the default) uses Ray Data for throughput.

Run `retriever pipeline run --help` for the authoritative flag list.
98 changes: 98 additions & 0 deletions nemo_retriever/docs/cli/benchmarking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Benchmarking with the `retriever` CLI

This page is the `retriever`-CLI counterpart to
`nv-ingest/docs/docs/extraction/benchmarking.md` and
`nv-ingest/tools/harness/README.md`.

The old benchmarking workflow is driven by `tools/harness` and
`uv run nv-ingest-harness-run`. The `retriever` CLI exposes the harness (and
per-stage micro-benchmarks) as first-class subcommands, so you can run
benchmarks without `uv run` or a separate harness repo.

## Harness (end-to-end benchmarks)

Old:

```bash
cd tools/harness
uv sync
uv run nv-ingest-harness-run --case=e2e --dataset=bo767
uv run nv-ingest-harness-run --case=e2e --dataset=/path/to/your/data
```

New — the harness is a subcommand on the main CLI (full parity):

```bash
retriever harness run --case=e2e --dataset=bo767
retriever harness run --case=e2e --dataset=/path/to/your/data
```

Related commands (browse with `--help`):

```bash
retriever harness --help # run, sweep, nightly, summary, compare
retriever harness run --help
retriever harness sweep --help
retriever harness nightly --help
retriever harness summary --help
retriever harness compare --help
```

### Harness with image / text storage

Old:

```bash
retriever harness run --dataset bo20 --preset single_gpu \
--override store_images_uri=stored_images --override store_text=true
```

New (unchanged — this form is already the `retriever` CLI):

```bash
retriever harness run --dataset bo20 --preset single_gpu \
--override store_images_uri=stored_images --override store_text=true
```

When `store_images_uri` is a relative path it resolves to
`artifact_dir/stored_images/` per run; absolute paths and fsspec URIs
(e.g. `s3://bucket/prefix`) are passed through unchanged.

## Per-stage micro-benchmarks

The new CLI also exposes stage-level throughput benchmarks that had no direct
counterpart in `nv-ingest-cli`:

```bash
retriever benchmark --help # split, extract, audio-extract, page-elements, ocr, all
retriever benchmark split --help
retriever benchmark extract --help
retriever benchmark audio-extract --help
retriever benchmark page-elements --help
retriever benchmark ocr --help
retriever benchmark all --help
```

Example — benchmark the PDF extraction actor:

```bash
retriever benchmark extract ./data/pdf_corpus \
--pdf-extract-batch-size 8 \
--pdf-extract-actors 4
```

Each benchmark reports rows/sec (or chunk rows/sec for audio) for its actor.
Use these when you want focused numbers for a single stage instead of an
end-to-end run.

## Parity notes

- The harness use-cases in the old docs (`--case=e2e`, `--dataset=bo767`,
`--dataset=/path/...`, `--override ...`) are preserved verbatim — only the
launcher changes (`retriever harness run …` instead of
`uv run nv-ingest-harness-run …`).
- If you have a repo-local `uv` environment, `uv run retriever harness run …`
still works.
- Stage benchmarks (`retriever benchmark …`) are net-new relative to the old
`nv-ingest-cli` examples — they are the recommended way to profile
individual actors before tuning `pipeline run` flags.
126 changes: 126 additions & 0 deletions nemo_retriever/docs/cli/cli-client-usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# `retriever` CLI — Client-Usage Walk-through

This page is the `retriever`-CLI counterpart to
`nv-ingest/client/client_examples/examples/cli_client_usage.ipynb`.

The original notebook walks through `nv-ingest-cli` by:

1. Printing `--help`.
2. Submitting a single PDF with `extract + dedup + filter` tasks.
3. Submitting a dataset of PDFs with the same task set.

The equivalent `retriever` workflow is shown below. You can drop these cells
into a new notebook (e.g. `retriever_client_usage.ipynb`) alongside the old
one.

## 1. Help

```bash
retriever --help
retriever pipeline run --help
```

Top-level `--help` lists the subcommand tree; `pipeline run --help` shows the
ingest-specific flags you will actually use in this walk-through.

## 2. Submit a single PDF

Old notebook cell:

```bash
nv-ingest-cli \
--doc ${SAMPLE_PDF0} \
--task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_text": true, "extract_images": true, "extract_tables": true, "extract_tables_method": "yolox"}' \
--task='dedup:{"content_type": "image", "filter": true}' \
--task='filter:{"content_type": "image", "min_size": 128, "max_aspect_ratio": 5.0, "min_aspect_ratio": 0.2, "filter": true}' \
--client_host=${REDIS_HOST} \
--client_port=${REDIS_PORT} \
--output_directory=${OUTPUT_DIRECTORY_SINGLE}
```

New:

```bash
retriever pipeline run "${SAMPLE_PDF0}" \
--input-type pdf \
--method pdfium \
--extract-text --extract-tables --extract-charts \
--dedup --dedup-iou-thres 0.45 \
--store-images-uri "${OUTPUT_DIRECTORY_SINGLE}/images" \
--strip-base64 \
--save-intermediate "${OUTPUT_DIRECTORY_SINGLE}"
```

### Parity notes

- `extract_tables_method:"yolox"` is not a CLI selector — the pipeline picks
its table/structure detectors automatically. Tables are still extracted.
- `dedup:{content_type:"image", filter:true}` maps to `--dedup` (with
`--dedup-iou-thres` for the IoU threshold).
- `filter:{content_type:"image", min_size, min/max_aspect_ratio, filter:true}`
**has no parity.** There is no image scale/aspect-ratio filter in the
`retriever` CLI today. If that matters, drop to the Python API or keep the
old `nv-ingest-cli` for that example.
- `extract_images:true` is implicitly satisfied by `--store-images-uri`
(images are extracted and persisted to the URI).

## 3. Submit a dataset of PDFs

Old notebook cell:

```bash
nv-ingest-cli \
--dataset ${BATCH_FILE} \
--task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_text": true, "extract_images": true, "extract_tables": true, "extract_tables_method": "yolox"}' \
--task='dedup:{"content_type": "image", "filter": true}' \
--task='filter:{"content_type": "image", "min_size": 128, "max_aspect_ratio": 5.0, "min_aspect_ratio": 0.2, "filter": true}' \
--client_host=${REDIS_HOST} \
--client_port=${REDIS_PORT} \
--output_directory=${OUTPUT_DIRECTORY_BATCH}
```

New — point `retriever` at a directory of PDFs instead of a dataset JSON:

```bash
# Assume $PDF_DIR is a directory holding your batch of PDFs.
retriever pipeline run "${PDF_DIR}" \
--input-type pdf \
--method pdfium \
--extract-text --extract-tables --extract-charts \
--dedup --dedup-iou-thres 0.45 \
--store-images-uri "${OUTPUT_DIRECTORY_BATCH}/images" \
--strip-base64 \
--save-intermediate "${OUTPUT_DIRECTORY_BATCH}"
```

### Parity notes

- The `dataset.json` (`sampled_files`) format and `gen_dataset.py` sampler
are not reproduced. Materialize a directory (or glob) containing the files
you want to process.
- The `--shuffle_dataset` knob is not present; set Ray block / batch sizes
via `--pdf-split-batch`, `--pdf-split-batch-size`, etc. for throughput.

## 4. Inspect results

```python
import pyarrow.parquet as pq
import lancedb

# Parquet extraction dumps written by --save-intermediate:
df = pq.read_table(OUTPUT_DIRECTORY_BATCH).to_pandas()
print(df[["source_id", "text", "content_type"]].head())

# LanceDB rows (default table name "nv-ingest"):
db = lancedb.connect("./lancedb")
tbl = db.open_table("nv-ingest")
print(tbl.to_pandas().head())
```

## Migration summary

| Old notebook cell | New `retriever` form | Parity |
|-------------------|----------------------|--------|
| `!nv-ingest-cli --help` | `!retriever --help` (plus `retriever pipeline run --help`) | Full |
| Single-file extract + dedup + filter | `retriever pipeline run <file> … --dedup …` | Partial — no image-size/aspect filter, `extract_tables_method` auto-selected |
| Dataset extract + dedup + filter | `retriever pipeline run <dir> …` | Partial — no `dataset.json` loader; use a directory |
92 changes: 92 additions & 0 deletions nemo_retriever/docs/cli/pdf-blueprint.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# PDF Blueprint — `retriever` CLI Replacement

This page is the `retriever`-CLI counterpart to the CLI cell in
`nv-ingest/deploy/pdf-blueprint.ipynb`.

## Original blueprint cell

```bash
nv-ingest-cli \
--doc nv-ingest/data/multimodal_test.pdf \
--output_directory ./processed_docs \
--task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_tables": "true", "extract_images": "true", "extract_charts": "true"}' \
--client_host=host.docker.internal \
--client_port=7670
```

This submits the blueprint's multimodal sample PDF to the running ingest
service and asks for text + tables + charts + images.

## `retriever` equivalent

```bash
retriever pipeline run nv-ingest/data/multimodal_test.pdf \
--input-type pdf \
--method pdfium \
--extract-text --extract-tables --extract-charts \
--store-images-uri ./processed_docs/images \
--strip-base64 \
--save-intermediate ./processed_docs
```

### What you get (end-user outcome)

- The same multimodal content (text, table markdown, chart descriptions,
extracted images) is produced.
- Text / table / chart rows land in LanceDB at `./lancedb/nv-ingest.lance`.
- Parquet extraction rows are written under `./processed_docs/`.
- Extracted images are written under `./processed_docs/images/`, referenced by
`content_url` in the row metadata.

### Notebook-friendly form

To keep the notebook self-contained, prefix the shell cell with `!`:

```bash
!retriever pipeline run nv-ingest/data/multimodal_test.pdf \
--input-type pdf \
--method pdfium \
--extract-text --extract-tables --extract-charts \
--store-images-uri ./processed_docs/images \
--strip-base64 \
--save-intermediate ./processed_docs
```

And inspect the results in the next cell:

```python
import pyarrow.parquet as pq
import lancedb

df = pq.read_table("./processed_docs").to_pandas()
print(df[["source_id", "content_type"]].value_counts())

db = lancedb.connect("./lancedb")
tbl = db.open_table("nv-ingest")
print(tbl.to_pandas().head())
```

## Migrating the blueprint `pip install` cell

The blueprint also installs `nv-ingest-client==25.9.0`. For the `retriever`
path, install `nemo-retriever` instead (see `nemo_retriever/README.md` for
current pinned versions):

```bash
pip install "nemo-retriever==26.3.0" \
nv-ingest-client==26.3.0 nv-ingest==26.3.0 nv-ingest-api==26.3.0 \
pymilvus[bulk_writer,model] \
minio \
tritonclient \
langchain_milvus
```

## Parity notes

- `client_host=host.docker.internal` / `client_port=7670` are irrelevant here:
`retriever pipeline run` is in-process, so the blueprint no longer needs a
running `nv-ingest-ms-runtime` container for the CLI cell.
- If you still want the blueprint to hit a live service (for example to
exercise the REST API), replace the CLI cell with a `retriever online serve`
container plus `retriever online stream-pdf` for per-page NDJSON output.
Note that `retriever online submit` is currently a stub.
Loading