Skip to content

Add support for Slurm arrays#2059

Open
sarahyurick wants to merge 13 commits into
NVIDIA-NeMo:mainfrom
sarahyurick:slurm_array
Open

Add support for Slurm arrays#2059
sarahyurick wants to merge 13 commits into
NVIDIA-NeMo:mainfrom
sarahyurick:slurm_array

Conversation

@sarahyurick

@sarahyurick sarahyurick commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

TODO:

  • Add Slurm array parameters to FilePartitioningStage
  • Propagate Slurm array parameters through JsonlReader, ParquetReader, etc.
  • Add retry support
  • Add FailedTask support
  • Add a tutorial
  • Add nemo-curator-slurm-cli
  • Address case when SLURM_ARRAY_TASK_COUNT > cluster limit
  • Add tests

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

sarahyurick and others added 4 commits June 9, 2026 14:54
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
embedding_vllm_init_kwargs: dict[str, Any] | None = None
hf_token: str | None = None
model_cache_dir: str | None = None
enable_array_partitioning: bool = False

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it is useful here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @praateekmahajan I blindly added this here, but thinking more about it this probably does not make sense since the next workflows need to be run globally (in a single Slurm job)...

@sarahyurick sarahyurick marked this pull request as ready for review June 11, 2026 17:31
@sarahyurick sarahyurick requested review from a team, abhinavg4 and suiyoubi as code owners June 11, 2026 17:31
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
sarahyurick and others added 6 commits June 11, 2026 12:53
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
# Guarantee every emitted task has a task_id (derived id, or uuid fallback).
results = self._post_process_task_ids(tasks, results)

self._record_failed_tasks([r for r in results if isinstance(r, FailedTask)])

@sarahyurick sarahyurick Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @abhinavg4 . For now the PR keeps track of FailedTask instances by looking for a user-set FAILED_TASKS_DIR_ENV_VAR = "NEMO_CURATOR_FAILED_TASKS_DIR" and writing a JSON file per failed task in the specified directory.

I did the environment variable and write approach because it seems more reliable than trying to handle a global Python variable, etc. And the reason it is an environment variable is so that BaseStageAdapter does not have to propagate an additional parameter for every single stage (which I think would involve having to update the executors as well?). Open to other suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant