Add support for Slurm arrays by sarahyurick · Pull Request #2059 · NVIDIA-NeMo/Curator

sarahyurick · 2026-06-09T21:31:11Z

TODO:

Add Slurm array parameters to FilePartitioningStage
Propagate Slurm array parameters through JsonlReader, ParquetReader, etc.
Add retry support
Add FailedTask support
Add a tutorial
Add nemo-curator-slurm-cli
Address case when SLURM_ARRAY_TASK_COUNT > cluster limit
Add tests

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

copy-pr-bot · 2026-06-09T21:31:14Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

sarahyurick · 2026-06-10T18:12:57Z

    embedding_vllm_init_kwargs: dict[str, Any] | None = None
    hf_token: str | None = None
    model_cache_dir: str | None = None
+    enable_array_partitioning: bool = False


Not sure if it is useful here.

cc @praateekmahajan I blindly added this here, but thinking more about it this probably does not make sense since the next workflows need to be run globally (in a single Slurm job)...

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

sarahyurick · 2026-06-12T16:37:10Z

        # Guarantee every emitted task has a task_id (derived id, or uuid fallback).
        results = self._post_process_task_ids(tasks, results)

+        self._record_failed_tasks([r for r in results if isinstance(r, FailedTask)])


Discussed with @abhinavg4 . For now the PR keeps track of FailedTask instances by looking for a user-set FAILED_TASKS_DIR_ENV_VAR = "NEMO_CURATOR_FAILED_TASKS_DIR" and writing a JSON file per failed task in the specified directory.

I did the environment variable and write approach because it seems more reliable than trying to handle a global Python variable, etc. And the reason it is an environment variable is so that BaseStageAdapter does not have to propagate an additional parameter for every single stage (which I think would involve having to update the executors as well?). Open to other suggestions.

basic slurm array file partitioning

bde2217

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

sarahyurick and others added 4 commits June 9, 2026 14:54

add slurm array params to composite stages using filepartitioningstage

a0595f6

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

add tutorial and tests

43ee179

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

Merge branch 'main' into slurm_array

cae17b3

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

ruff

acfeceb

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

sarahyurick commented Jun 11, 2026

View reviewed changes

sarahyurick marked this pull request as ready for review June 11, 2026 17:31

sarahyurick requested review from a team, abhinavg4 and suiyoubi as code owners June 11, 2026 17:31

copy-pr-bot Bot temporarily deployed to public June 11, 2026 17:31 Inactive

copy-pr-bot Bot temporarily deployed to test June 11, 2026 17:32 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci June 11, 2026 17:32 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci June 11, 2026 18:19 Error

ruff

bb1e30a

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

copy-pr-bot Bot temporarily deployed to public June 11, 2026 18:21 Inactive

copy-pr-bot Bot temporarily deployed to test June 11, 2026 18:21 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci June 11, 2026 18:22 Inactive

sarahyurick and others added 6 commits June 11, 2026 12:53

more greptile comments

2ccbd3f

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

add nonetask and failedtask sentinels

1b659ea

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

add failedtask detection and repeat

3522809

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

ruff

717edac

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

Merge branch 'main' into slurm_array

8f2345b

greptile comments

ebba73e

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

sarahyurick commented Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Slurm arrays#2059

Add support for Slurm arrays#2059
sarahyurick wants to merge 13 commits into
NVIDIA-NeMo:mainfrom
sarahyurick:slurm_array

sarahyurick commented Jun 9, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 9, 2026

Uh oh!

sarahyurick Jun 10, 2026

Uh oh!

sarahyurick Jun 11, 2026

Uh oh!

sarahyurick Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sarahyurick commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Jun 9, 2026

Uh oh!

sarahyurick Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sarahyurick commented Jun 9, 2026 •

edited

Loading

sarahyurick Jun 12, 2026 •

edited

Loading