fix: default workflow input extensions by filetype#2045
Conversation
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
|
Follow-up from #1982 review:
Validation:
|
Greptile SummaryThis PR fixes #1330 by making all deduplication workflows default
Confidence Score: 5/5Safe to merge — the change is narrowly scoped to defaulting file-extension lists and is well covered by new unit tests across all five affected sites. All modified call sites follow a consistent, straightforward pattern. The new helper is small and its error path is exercised by tests. The only note is that nemo_curator/utils/file_utils.py — the new helper returns a shared list reference; a copy would be safer. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Workflow / Stage constructor] --> B{input_file_extensions\nprovided?}
B -- Yes --> C[Use provided extensions]
B -- No --> D[get_default_file_extensions\ninput_filetype]
D --> E{filetype in\nFILETYPE_TO_DEFAULT_EXTENSIONS?}
E -- Yes --> F[Return default list\ne.g. jsonl → .jsonl .json]
E -- No --> G[Raise ValueError\nUnsupported filetype]
C --> H[Pass to FilePartitioningStage\nor Reader stage]
F --> H
Reviews (2): Last reviewed commit: "test: cover kmeans input file extension ..." | Re-trigger Greptile |
| file_extensions = self.input_file_extensions or get_default_file_extensions(self.input_filetype) | ||
| if not file_extensions: | ||
| msg = f"Unsupported filetype: {self.input_filetype}" | ||
| raise ValueError(msg) |
There was a problem hiding this comment.
The
if not file_extensions: guard is now unreachable. get_default_file_extensions either returns a non-empty list or raises ValueError, so file_extensions will never be falsy after the assignment. The block can be removed.
| file_extensions = self.input_file_extensions or get_default_file_extensions(self.input_filetype) | |
| if not file_extensions: | |
| msg = f"Unsupported filetype: {self.input_filetype}" | |
| raise ValueError(msg) | |
| file_extensions = self.input_file_extensions or get_default_file_extensions(self.input_filetype) |
There was a problem hiding this comment.
Addressed in 19a7290: removed the unreachable guard and now rely on get_default_file_extensions to raise for unsupported input_filetype.
sarahyurick
left a comment
There was a problem hiding this comment.
Shouldn't https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/deduplication/semantic/workflow.py#L268 be updated too or no?
| file_extensions = self.input_file_extensions or get_default_file_extensions(self.input_filetype) | ||
| if not file_extensions: | ||
| msg = f"Unsupported filetype: {self.input_filetype}" | ||
| raise ValueError(msg) |
There was a problem hiding this comment.
Is there a test for this?
There was a problem hiding this comment.
Added focused KMeansStage decomposition tests in tests/stages/deduplication/semantic/test_kmeans.py covering default extensions for parquet/jsonl, custom extension overrides, and unsupported filetype errors.
|
Thanks for the follow-up. I pushed 19a7290 with the requested updates:
Validation:
Targeted pytest is still blocked on this macOS runner by NeMo-Curator’s Linux-only import guard. |
sarahyurick
left a comment
There was a problem hiding this comment.
Thanks @nightcityblade ! LGTM, although I will default to @praateekmahajan before merging, since he was the one who created the issue.
Description
Fixes #1330. Replacement for #1982 after the source branch had to be recovered.
Workflows that expose
input_file_extensionsnow default to the extensions for their selectedinput_filetypewhen no explicit override is provided, instead of falling back to the genericFilePartitioningStagedefaults.Usage
Checklist
Tests:
uv run ruff check tests/stages/text/deduplication/test_semantic.pyuv run ruff format --check tests/stages/text/deduplication/test_semantic.pypython3 -m py_compile tests/stages/text/deduplication/test_semantic.pyuv run pytest tests/stages/text/deduplication/test_semantic.py -k 'embedding_reader_extensions'(blocked locally: NeMo-Curator raises at import time on non-Linux/darwin)