Skip to content

fix: open remote JSONL files before cuDF reads#2076

Open
nightcityblade wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1569
Open

fix: open remote JSONL files before cuDF reads#2076
nightcityblade wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1569

Conversation

@nightcityblade

Copy link
Copy Markdown
Contributor

Description

Remote JSONL exact-dedup reads now open remote paths with fsspec before passing them to cudf.read_json. This avoids handing raw cloud URLs to cuDF for file lists, which prevents cuDF's directory-expansion path from issuing an extra isdir/metadata request per file before reading.

Closes #1569.

Usage

# Existing exact deduplication JSONL reads continue to use read_jsonl(...)
# Remote file paths are opened as fsspec file buffers before cudf.read_json.

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Testing

  • uv run ruff check nemo_curator/stages/deduplication/io_utils.py
  • uv run python -m py_compile nemo_curator/stages/deduplication/io_utils.py

Note: I could not run the Python test suite on this macOS machine because importing nemo_curator exits on non-Linux platforms.

Signed-off-by: nightcityblade <nightcityblade@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 14, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes exact-dedup JSONL reading for remote paths by pre-opening cloud files via fsspec and passing the resulting file buffers to cudf.read_json, rather than handing raw URLs directly to cuDF. This prevents cuDF's internal directory-expansion logic from issuing redundant isdir/metadata requests per file.

  • Adds _open_remote_files, a @contextmanager that opens each remote path with the appropriate fsspec filesystem, yields the buffer(s) and a cleaned kwargs dict (with storage_options stripped, since cuDF no longer needs it when given open buffers), and closes all handles on exit via ExitStack.
  • Falls through to the original code path unchanged for local paths and mixed local/remote lists.

Confidence Score: 4/5

Safe to merge — the change is narrow, correctly handles both single-path and multi-path inputs, properly closes file handles via ExitStack, and strips storage_options only from the cuDF call where they are no longer needed.

The logic is correct for both the single-string and list-of-strings cases, and cudf.read_json accepts a list of file-like objects. The one minor point is that a new filesystem object is instantiated for every file in a list rather than being reused across files that share the same backend, but this does not affect correctness. No functional test coverage was added for the remote-file code path.

nemo_curator/stages/deduplication/io_utils.py — specifically the _open_remote_files loop and its interaction with cudf for multi-file remote reads, which lacks automated test coverage.

Important Files Changed

Filename Overview
nemo_curator/stages/deduplication/io_utils.py Adds _open_remote_files context manager that pre-opens remote JSONL paths via fsspec before handing buffers to cudf.read_json, correctly handling both single-string and list-of-strings inputs and stripping storage_options from kwargs passed to cudf. Minor: a new filesystem object is created per file in the loop.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant read_jsonl
    participant _open_remote_files
    participant is_remote_url
    participant get_fs
    participant fsspec as fsspec (fs)
    participant cudf

    Caller->>read_jsonl: "read_jsonl(filepath, **kwargs)"
    read_jsonl->>_open_remote_files: enter context (filepath, kwargs)
    _open_remote_files->>is_remote_url: is_remote_url(path) for each path
    is_remote_url-->>_open_remote_files: True/False

    alt all paths are remote
        _open_remote_files->>get_fs: get_fs(path, storage_options) per file
        get_fs-->>_open_remote_files: fs instance
        _open_remote_files->>fsspec: fs.open(path, rb) per file
        fsspec-->>_open_remote_files: file buffer (entered into ExitStack)
        _open_remote_files-->>read_jsonl: yield (buffer(s), read_kwargs without storage_options)
        read_jsonl->>cudf: "cudf.read_json(buffer(s), lines=True, **read_kwargs)"
        cudf-->>read_jsonl: DataFrame
        read_jsonl->>_open_remote_files: exit context
        _open_remote_files->>fsspec: ExitStack closes all buffers
    else local or mixed paths
        _open_remote_files-->>read_jsonl: yield (original filepath, original kwargs)
        read_jsonl->>cudf: "cudf.read_json(filepath, lines=True, **kwargs)"
        cudf-->>read_jsonl: DataFrame
    end

    read_jsonl-->>Caller: DataFrame (with optional column filter / id assignment)
Loading

Reviews (1): Last reviewed commit: "fix: open remote jsonl files before cudf..." | Re-trigger Greptile

Comment on lines +60 to +62
for path in paths:
fs = get_fs(path, storage_options=storage_options)
buffers.append(stack.enter_context(fs.open(path, mode="rb")))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Per-file filesystem instantiation

A new filesystem object is created for every path in the loop via get_fs. For a list of N files that all share the same protocol and storage_options, this produces N separate client instances — each may negotiate its own connection or credential chain. A single fs instance could be created once (keyed on protocol) and reused across all files from the same backend.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Better handling of 429 errors when reading from cloud paths with cuDF

2 participants