test: cover remote pairwise file paths#2061
Conversation
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
Greptile SummaryThis PR adds a regression test for
Confidence Score: 4/5The test has a wrong The new test correctly models the protocol-stripping scenario and the overall structure is sound, but tests/stages/deduplication/semantic/test_pairwise_io.py — specifically the Important Files Changed
Sequence DiagramsequenceDiagram
participant T as Test
participant S as ClusterWiseFilePartitioningStage.process()
participant U as get_all_file_paths_under()
participant G as _gather_file_records()
participant F as FakeRemoteFs
T->>S: process(empty_task)
S->>F: ls("gs://bucket/kmeans")
F-->>S: "["bucket/kmeans/centroid=7"]"
S->>F: "unstrip_protocol("bucket/kmeans/centroid=7")"
F-->>S: "gs://bucket/kmeans/centroid=7"
S->>U: "get_all_file_paths_under("gs://bucket/kmeans/centroid=7", recurse_subdirectories=True, ...)"
U->>G: "_gather_file_records(..., recurse_subdirectories=True, ...)"
G->>F: "expand_path("gs://bucket/kmeans/centroid=7", recursive=False)"
F-->>G: "["gs://bucket/kmeans/centroid=7"]"
G->>F: "isdir("gs://bucket/kmeans/centroid=7")"
F-->>G: True
G->>F: "find("gs://bucket/kmeans/centroid=7", maxdepth=None, withdirs=False, detail=False)"
Note over F: assert maxdepth == 1 FAILS (actual is None)
F-->>G: "["bucket/kmeans/centroid=7/part.0.parquet"]"
G->>F: "unstrip_protocol("bucket/kmeans/centroid=7/part.0.parquet")"
F-->>G: "gs://bucket/kmeans/centroid=7/part.0.parquet"
G-->>U: "[("gs://bucket/kmeans/centroid=7/part.0.parquet", None)]"
U-->>S: "["gs://bucket/kmeans/centroid=7/part.0.parquet"]"
S-->>T: "[FileGroupTask(data=["gs://bucket/kmeans/centroid=7/part.0.parquet"])]"
Reviews (2): Last reviewed commit: "test: align fake remote fs kwargs" | Re-trigger Greptile |
| def expand_path(self, path: str, _recursive: bool = False) -> list[str]: | ||
| return [path] | ||
|
|
||
| def isdir(self, _path: str) -> bool: | ||
| return True | ||
|
|
||
| def find(self, _path: str, _maxdepth: int | None, _withdirs: bool, _detail: bool) -> list[str]: | ||
| return ["bucket/kmeans/centroid=7/part.0.parquet"] |
There was a problem hiding this comment.
Underscore-prefixed parameter names break keyword-argument dispatch
_gather_file_records calls fs.expand_path(path, recursive=False) and fs.find(root, maxdepth=..., withdirs=False, detail=...) using keyword arguments. Python matches kwargs by exact parameter name, so recursive=False never reaches _recursive, and maxdepth=/withdirs=/detail= never reach _maxdepth/_withdirs/_detail — Python will raise TypeError: expand_path() got an unexpected keyword argument 'recursive' before any assertion is checked. The _ prefix convention marks a parameter as "intentionally unused" but does not make it absorb a kwarg with the unprefixed name. All four parameter names need the underscore dropped (or the methods need **kwargs) to let the keyword calls land.
|
Thanks for the catch. I pushed 4cbaa5f to align the
Validation run locally:
|
| def find(self, path: str, maxdepth: int | None, withdirs: bool, detail: bool) -> list[str]: | ||
| assert path == "gs://bucket/kmeans/centroid=7" | ||
| assert maxdepth == 1 | ||
| assert withdirs is False | ||
| assert detail is False | ||
| return ["bucket/kmeans/centroid=7/part.0.parquet"] |
There was a problem hiding this comment.
maxdepth == 1 assertion is wrong and will cause the test to fail. process() calls get_all_file_paths_under(..., recurse_subdirectories=True, ...), and inside _gather_file_records the depth is computed as maxdepth=None if recurse_subdirectories else 1 — so with recurse_subdirectories=True the value passed to find is None, not 1.
| def find(self, path: str, maxdepth: int | None, withdirs: bool, detail: bool) -> list[str]: | |
| assert path == "gs://bucket/kmeans/centroid=7" | |
| assert maxdepth == 1 | |
| assert withdirs is False | |
| assert detail is False | |
| return ["bucket/kmeans/centroid=7/part.0.parquet"] | |
| def find(self, path: str, maxdepth: int | None, withdirs: bool, detail: bool) -> list[str]: | |
| assert path == "gs://bucket/kmeans/centroid=7" | |
| assert maxdepth is None | |
| assert withdirs is False | |
| assert detail is False | |
| return ["bucket/kmeans/centroid=7/part.0.parquet"] |
Description
Adds regression coverage for remote semantic dedup pairwise input listings where fsspec returns paths without the storage protocol. The test verifies that generated
FileGroupTaskparquet paths retain the full remote URL.Closes #1213.
Usage
N/A — test-only regression coverage.
Checklist
Validation:
uv run ruff check tests/stages/deduplication/semantic/test_pairwise_io.pyuv run ruff format --check tests/stages/deduplication/semantic/test_pairwise_io.pygit diff --checkNote: I attempted to run
uv run pytest tests/stages/deduplication/semantic/test_pairwise_io.py -q, but local execution is blocked because NeMo Curator raises on unsupported Darwin/macOS hosts before tests load.