Skip to content

fix: default workflow input extensions by filetype#2045

Open
nightcityblade wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1330
Open

fix: default workflow input extensions by filetype#2045
nightcityblade wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1330

Conversation

@nightcityblade

Copy link
Copy Markdown
Contributor

Description

Fixes #1330. Replacement for #1982 after the source branch had to be recovered.

Workflows that expose input_file_extensions now default to the extensions for their selected input_filetype when no explicit override is provided, instead of falling back to the generic FilePartitioningStage defaults.

Usage

from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow

workflow = ExactDeduplicationWorkflow(
    input_path="/data/input",
    output_path="/data/output",
    input_filetype="jsonl",
)
# Defaults to [".jsonl", ".json"] unless input_file_extensions is set.

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Tests:

  • uv run ruff check tests/stages/text/deduplication/test_semantic.py
  • uv run ruff format --check tests/stages/text/deduplication/test_semantic.py
  • python3 -m py_compile tests/stages/text/deduplication/test_semantic.py
  • uv run pytest tests/stages/text/deduplication/test_semantic.py -k 'embedding_reader_extensions' (blocked locally: NeMo-Curator raises at import time on non-Linux/darwin)

nightcityblade added 2 commits May 14, 2026 23:08
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 3, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nightcityblade

Copy link
Copy Markdown
Contributor Author

Follow-up from #1982 review:

  • Moved the semantic file-extension tests into the existing tests/stages/text/deduplication/test_semantic.py file instead of creating a new test file.
  • Kept nemo_curator/stages/deduplication/semantic/workflow.py unchanged because it delegates input partitioning to KMeansStage, and this PR passes input_file_extensions through to that stage; the default resolution happens in KMeansStage.decompose() before constructing FilePartitioningStage.

Validation:

  • uv run ruff check tests/stages/text/deduplication/test_semantic.py
  • uv run ruff format --check tests/stages/text/deduplication/test_semantic.py
  • python3 -m py_compile tests/stages/text/deduplication/test_semantic.py
  • Targeted pytest remains blocked on this Darwin runner by NeMo-Curator’s Linux-only import guard.

@greptile-apps

greptile-apps Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes #1330 by making all deduplication workflows default input_file_extensions to the appropriate extensions for the selected input_filetype (e.g. [".jsonl", ".json"] for "jsonl") instead of falling back to the generic FilePartitioningStage defaults that included all known extensions.

  • Extracts a new get_default_file_extensions helper in file_utils.py and threads it through five workflow/stage sites (ExactDeduplicationWorkflow, FuzzyDeduplicationWorkflow, KMeansStage, TextDuplicatesRemovalWorkflow, TextSemanticDeduplicationWorkflow) using the idiomatic self.input_file_extensions or get_default_file_extensions(self.input_filetype) pattern.
  • Removes the now-unreachable if not file_extensions guard in KMeansStage.decompose() that was previously flagged in review.
  • Adds comprehensive unit tests across all affected workflows, including parametrised default-extension checks, explicit-override checks, and unsupported-filetype error assertions; also removes autouse=True from ensure_semantic_model_downloaded to avoid spurious model downloads when running non-GPU tests.

Confidence Score: 5/5

Safe to merge — the change is narrowly scoped to defaulting file-extension lists and is well covered by new unit tests across all five affected sites.

All modified call sites follow a consistent, straightforward pattern. The new helper is small and its error path is exercised by tests. The only note is that get_default_file_extensions returns a direct reference to the mutable list inside the global dict rather than a copy, but no current caller mutates that list. Logic is correct and the test suite covers defaults, overrides, and invalid filetypes.

nemo_curator/utils/file_utils.py — the new helper returns a shared list reference; a copy would be safer.

Important Files Changed

Filename Overview
nemo_curator/utils/file_utils.py Adds get_default_file_extensions helper that centralises filetype-to-extension mapping; returns a direct reference to the mutable list in the global dict rather than a copy.
nemo_curator/stages/deduplication/exact/workflow.py Uses get_default_file_extensions as fallback when input_file_extensions is not set; clean one-line change.
nemo_curator/stages/deduplication/fuzzy/workflow.py Same pattern as exact workflow — delegates to get_default_file_extensions when no explicit extensions are given.
nemo_curator/stages/deduplication/semantic/kmeans.py Replaces inline dict lookup + unreachable guard with get_default_file_extensions; removes dead code correctly.
nemo_curator/stages/text/deduplication/removal_workflow.py Applies the same input_file_extensions or get_default_file_extensions(...) pattern to the removal workflow's FilePartitioningStage.
nemo_curator/stages/text/deduplication/semantic.py Applies the fix to both the jsonl and parquet reader branches inside _run_embedding_generation; the existing else: raise NotImplementedError still fires before get_default_file_extensions for genuinely unsupported types.
tests/stages/text/deduplication/test_semantic.py Adds three monkeypatched unit tests covering default extension selection, explicit override, and unsupported-filetype error; also removes autouse=True from ensure_semantic_model_downloaded to avoid downloading the model for non-GPU tests, and wires it explicitly into the integration-test fixture.
tests/stages/deduplication/exact/test_workflow.py Adds tests for default extension behaviour and explicit override in ExactDeduplicationWorkflow.
tests/stages/deduplication/fuzzy/test_fuzzy_workflow.py Mirrors the exact-dedup tests for FuzzyDeduplicationWorkflow.
tests/stages/deduplication/semantic/test_kmeans.py New TestKMeansStage class covers default extensions for parquet/jsonl, override, and the ValueError path for unsupported filetypes.
tests/stages/text/deduplication/test_removal_workflow.py Extends the parametrised test_reader_stage to assert filetype-specific extensions and adds a separate custom-override test.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Workflow / Stage constructor] --> B{input_file_extensions\nprovided?}
    B -- Yes --> C[Use provided extensions]
    B -- No --> D[get_default_file_extensions\ninput_filetype]
    D --> E{filetype in\nFILETYPE_TO_DEFAULT_EXTENSIONS?}
    E -- Yes --> F[Return default list\ne.g. jsonl → .jsonl .json]
    E -- No --> G[Raise ValueError\nUnsupported filetype]
    C --> H[Pass to FilePartitioningStage\nor Reader stage]
    F --> H
Loading

Reviews (2): Last reviewed commit: "test: cover kmeans input file extension ..." | Re-trigger Greptile

Comment on lines 347 to 350
file_extensions = self.input_file_extensions or get_default_file_extensions(self.input_filetype)
if not file_extensions:
msg = f"Unsupported filetype: {self.input_filetype}"
raise ValueError(msg)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The if not file_extensions: guard is now unreachable. get_default_file_extensions either returns a non-empty list or raises ValueError, so file_extensions will never be falsy after the assignment. The block can be removed.

Suggested change
file_extensions = self.input_file_extensions or get_default_file_extensions(self.input_filetype)
if not file_extensions:
msg = f"Unsupported filetype: {self.input_filetype}"
raise ValueError(msg)
file_extensions = self.input_file_extensions or get_default_file_extensions(self.input_filetype)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 19a7290: removed the unreachable guard and now rely on get_default_file_extensions to raise for unsupported input_filetype.

@sarahyurick sarahyurick left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 347 to 350
file_extensions = self.input_file_extensions or get_default_file_extensions(self.input_filetype)
if not file_extensions:
msg = f"Unsupported filetype: {self.input_filetype}"
raise ValueError(msg)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test for this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added focused KMeansStage decomposition tests in tests/stages/deduplication/semantic/test_kmeans.py covering default extensions for parquet/jsonl, custom extension overrides, and unsupported filetype errors.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added waiting-on-customer Waiting on the original author to respond and removed waiting-on-customer Waiting on the original author to respond labels Jun 3, 2026
@nightcityblade

Copy link
Copy Markdown
Contributor Author

Thanks for the follow-up. I pushed 19a7290 with the requested updates:

  • Removed the unreachable if not file_extensions: guard in KMeansStage.decompose().
  • Added focused tests for KMeansStage default file-extension selection, override handling, and unsupported filetype errors.
  • I left semantic/workflow.py unchanged: it already passes input_file_extensions through to KMeansStage, so the default resolution/validation now happens in one place during KMeansStage.decompose().

Validation:

  • uv run ruff check nemo_curator/stages/deduplication/semantic/kmeans.py tests/stages/deduplication/semantic/test_kmeans.py
  • uv run ruff format --check nemo_curator/stages/deduplication/semantic/kmeans.py tests/stages/deduplication/semantic/test_kmeans.py
  • python3 -m py_compile nemo_curator/stages/deduplication/semantic/kmeans.py tests/stages/deduplication/semantic/test_kmeans.py

Targeted pytest is still blocked on this macOS runner by NeMo-Curator’s Linux-only import guard.

@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label Jun 4, 2026

@sarahyurick sarahyurick left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nightcityblade ! LGTM, although I will default to @praateekmahajan before merging, since he was the one who created the issue.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-maintainers Waiting on maintainers to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

See if all workflows expose file_extensions for reader i/o

3 participants