Skip to content

test: cover remote pairwise file paths#2061

Open
nightcityblade wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1213
Open

test: cover remote pairwise file paths#2061
nightcityblade wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1213

Conversation

@nightcityblade

Copy link
Copy Markdown
Contributor

Description

Adds regression coverage for remote semantic dedup pairwise input listings where fsspec returns paths without the storage protocol. The test verifies that generated FileGroupTask parquet paths retain the full remote URL.

Closes #1213.

Usage

N/A — test-only regression coverage.

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Validation:

  • uv run ruff check tests/stages/deduplication/semantic/test_pairwise_io.py
  • uv run ruff format --check tests/stages/deduplication/semantic/test_pairwise_io.py
  • git diff --check

Note: I attempted to run uv run pytest tests/stages/deduplication/semantic/test_pairwise_io.py -q, but local execution is blocked because NeMo Curator raises on unsupported Darwin/macOS hosts before tests load.

Signed-off-by: nightcityblade <nightcityblade@gmail.com>
@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a regression test for ClusterWiseFilePartitioningStage to verify that FileGroupTask parquet paths retain full remote URLs (gs://…) even when fsspec's ls/find return paths stripped of the storage protocol.

  • A new FakeRemoteFs inner class stubs ls, expand_path, isdir, and find to simulate a remote GCS filesystem that returns protocol-stripped paths, then asserts that the stage's output carries fully-qualified gs:// URLs.
  • The FakeRemoteFs.find method asserts maxdepth == 1, but process() calls get_all_file_paths_under(..., recurse_subdirectories=True, ...), which causes _gather_file_records to forward maxdepth=None — making this assertion always wrong and causing the test to fail on execution.

Confidence Score: 4/5

The test has a wrong maxdepth assertion that will fail on any real test run, making the regression coverage ineffective until fixed.

The new test correctly models the protocol-stripping scenario and the overall structure is sound, but assert maxdepth == 1 inside FakeRemoteFs.find contradicts what the production code actually passes (None, because recurse_subdirectories=True). The test cannot pass in its current state.

tests/stages/deduplication/semantic/test_pairwise_io.py — specifically the maxdepth assertion in FakeRemoteFs.find

Important Files Changed

Filename Overview
tests/stages/deduplication/semantic/test_pairwise_io.py Adds a remote-path regression test for ClusterWiseFilePartitioningStage, but contains a wrong maxdepth == 1 assertion — production code passes maxdepth=None (recursive mode), so the test will always fail when executed.

Sequence Diagram

sequenceDiagram
    participant T as Test
    participant S as ClusterWiseFilePartitioningStage.process()
    participant U as get_all_file_paths_under()
    participant G as _gather_file_records()
    participant F as FakeRemoteFs

    T->>S: process(empty_task)
    S->>F: ls("gs://bucket/kmeans")
    F-->>S: "["bucket/kmeans/centroid=7"]"
    S->>F: "unstrip_protocol("bucket/kmeans/centroid=7")"
    F-->>S: "gs://bucket/kmeans/centroid=7"
    S->>U: "get_all_file_paths_under("gs://bucket/kmeans/centroid=7", recurse_subdirectories=True, ...)"
    U->>G: "_gather_file_records(..., recurse_subdirectories=True, ...)"
    G->>F: "expand_path("gs://bucket/kmeans/centroid=7", recursive=False)"
    F-->>G: "["gs://bucket/kmeans/centroid=7"]"
    G->>F: "isdir("gs://bucket/kmeans/centroid=7")"
    F-->>G: True
    G->>F: "find("gs://bucket/kmeans/centroid=7", maxdepth=None, withdirs=False, detail=False)"
    Note over F: assert maxdepth == 1 FAILS (actual is None)
    F-->>G: "["bucket/kmeans/centroid=7/part.0.parquet"]"
    G->>F: "unstrip_protocol("bucket/kmeans/centroid=7/part.0.parquet")"
    F-->>G: "gs://bucket/kmeans/centroid=7/part.0.parquet"
    G-->>U: "[("gs://bucket/kmeans/centroid=7/part.0.parquet", None)]"
    U-->>S: "["gs://bucket/kmeans/centroid=7/part.0.parquet"]"
    S-->>T: "[FileGroupTask(data=["gs://bucket/kmeans/centroid=7/part.0.parquet"])]"
Loading

Reviews (2): Last reviewed commit: "test: align fake remote fs kwargs" | Re-trigger Greptile

Comment on lines +124 to +131
def expand_path(self, path: str, _recursive: bool = False) -> list[str]:
return [path]

def isdir(self, _path: str) -> bool:
return True

def find(self, _path: str, _maxdepth: int | None, _withdirs: bool, _detail: bool) -> list[str]:
return ["bucket/kmeans/centroid=7/part.0.parquet"]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Underscore-prefixed parameter names break keyword-argument dispatch

_gather_file_records calls fs.expand_path(path, recursive=False) and fs.find(root, maxdepth=..., withdirs=False, detail=...) using keyword arguments. Python matches kwargs by exact parameter name, so recursive=False never reaches _recursive, and maxdepth=/withdirs=/detail= never reach _maxdepth/_withdirs/_detail — Python will raise TypeError: expand_path() got an unexpected keyword argument 'recursive' before any assertion is checked. The _ prefix convention marks a parameter as "intentionally unused" but does not make it absorb a kwarg with the unprefixed name. All four parameter names need the underscore dropped (or the methods need **kwargs) to let the keyword calls land.

@copy-pr-bot

copy-pr-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nightcityblade

Copy link
Copy Markdown
Contributor Author

Thanks for the catch. I pushed 4cbaa5f to align the FakeRemoteFs helper with the keyword arguments used by _gather_file_records:

  • expand_path(..., recursive=False) now accepts the real recursive= kwarg and asserts the expected value.
  • find(..., maxdepth=..., withdirs=..., detail=...) now accepts the real kwargs and asserts the expected call shape.

Validation run locally:

  • ruff check tests/stages/deduplication/semantic/test_pairwise_io.py
  • ruff format --check tests/stages/deduplication/semantic/test_pairwise_io.py
  • python3 -m py_compile tests/stages/deduplication/semantic/test_pairwise_io.py
  • Targeted pytest attempted, but this local checkout is missing project runtime deps (ray via tests/conftest.py, and cosmos_xenna when bypassing conftest), so the test could not execute in this environment.

Comment on lines +131 to +136
def find(self, path: str, maxdepth: int | None, withdirs: bool, detail: bool) -> list[str]:
assert path == "gs://bucket/kmeans/centroid=7"
assert maxdepth == 1
assert withdirs is False
assert detail is False
return ["bucket/kmeans/centroid=7/part.0.parquet"]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 maxdepth == 1 assertion is wrong and will cause the test to fail. process() calls get_all_file_paths_under(..., recurse_subdirectories=True, ...), and inside _gather_file_records the depth is computed as maxdepth=None if recurse_subdirectories else 1 — so with recurse_subdirectories=True the value passed to find is None, not 1.

Suggested change
def find(self, path: str, maxdepth: int | None, withdirs: bool, detail: bool) -> list[str]:
assert path == "gs://bucket/kmeans/centroid=7"
assert maxdepth == 1
assert withdirs is False
assert detail is False
return ["bucket/kmeans/centroid=7/part.0.parquet"]
def find(self, path: str, maxdepth: int | None, withdirs: bool, detail: bool) -> list[str]:
assert path == "gs://bucket/kmeans/centroid=7"
assert maxdepth is None
assert withdirs is False
assert detail is False
return ["bucket/kmeans/centroid=7/part.0.parquet"]

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-maintainers Waiting on maintainers to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Semantic Dedup Pairwise IO fails to list files for remote cloud path

2 participants