fix: open remote JSONL files before cuDF reads#2076
Conversation
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
Greptile SummaryThis PR fixes exact-dedup JSONL reading for remote paths by pre-opening cloud files via fsspec and passing the resulting file buffers to
Confidence Score: 4/5Safe to merge — the change is narrow, correctly handles both single-path and multi-path inputs, properly closes file handles via ExitStack, and strips storage_options only from the cuDF call where they are no longer needed. The logic is correct for both the single-string and list-of-strings cases, and cudf.read_json accepts a list of file-like objects. The one minor point is that a new filesystem object is instantiated for every file in a list rather than being reused across files that share the same backend, but this does not affect correctness. No functional test coverage was added for the remote-file code path. nemo_curator/stages/deduplication/io_utils.py — specifically the Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant read_jsonl
participant _open_remote_files
participant is_remote_url
participant get_fs
participant fsspec as fsspec (fs)
participant cudf
Caller->>read_jsonl: "read_jsonl(filepath, **kwargs)"
read_jsonl->>_open_remote_files: enter context (filepath, kwargs)
_open_remote_files->>is_remote_url: is_remote_url(path) for each path
is_remote_url-->>_open_remote_files: True/False
alt all paths are remote
_open_remote_files->>get_fs: get_fs(path, storage_options) per file
get_fs-->>_open_remote_files: fs instance
_open_remote_files->>fsspec: fs.open(path, rb) per file
fsspec-->>_open_remote_files: file buffer (entered into ExitStack)
_open_remote_files-->>read_jsonl: yield (buffer(s), read_kwargs without storage_options)
read_jsonl->>cudf: "cudf.read_json(buffer(s), lines=True, **read_kwargs)"
cudf-->>read_jsonl: DataFrame
read_jsonl->>_open_remote_files: exit context
_open_remote_files->>fsspec: ExitStack closes all buffers
else local or mixed paths
_open_remote_files-->>read_jsonl: yield (original filepath, original kwargs)
read_jsonl->>cudf: "cudf.read_json(filepath, lines=True, **kwargs)"
cudf-->>read_jsonl: DataFrame
end
read_jsonl-->>Caller: DataFrame (with optional column filter / id assignment)
Reviews (1): Last reviewed commit: "fix: open remote jsonl files before cudf..." | Re-trigger Greptile |
| for path in paths: | ||
| fs = get_fs(path, storage_options=storage_options) | ||
| buffers.append(stack.enter_context(fs.open(path, mode="rb"))) |
There was a problem hiding this comment.
Per-file filesystem instantiation
A new filesystem object is created for every path in the loop via get_fs. For a list of N files that all share the same protocol and storage_options, this produces N separate client instances — each may negotiate its own connection or credential chain. A single fs instance could be created once (keyed on protocol) and reused across all files from the same backend.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Description
Remote JSONL exact-dedup reads now open remote paths with fsspec before passing them to
cudf.read_json. This avoids handing raw cloud URLs to cuDF for file lists, which prevents cuDF's directory-expansion path from issuing an extraisdir/metadata request per file before reading.Closes #1569.
Usage
Checklist
Testing
uv run ruff check nemo_curator/stages/deduplication/io_utils.pyuv run python -m py_compile nemo_curator/stages/deduplication/io_utils.pyNote: I could not run the Python test suite on this macOS machine because importing
nemo_curatorexits on non-Linux platforms.