Add support for Slurm arrays#2059
Conversation
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
| embedding_vllm_init_kwargs: dict[str, Any] | None = None | ||
| hf_token: str | None = None | ||
| model_cache_dir: str | None = None | ||
| enable_array_partitioning: bool = False |
There was a problem hiding this comment.
Not sure if it is useful here.
There was a problem hiding this comment.
cc @praateekmahajan I blindly added this here, but thinking more about it this probably does not make sense since the next workflows need to be run globally (in a single Slurm job)...
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
| # Guarantee every emitted task has a task_id (derived id, or uuid fallback). | ||
| results = self._post_process_task_ids(tasks, results) | ||
|
|
||
| self._record_failed_tasks([r for r in results if isinstance(r, FailedTask)]) |
There was a problem hiding this comment.
Discussed with @abhinavg4 . For now the PR keeps track of FailedTask instances by looking for a user-set FAILED_TASKS_DIR_ENV_VAR = "NEMO_CURATOR_FAILED_TASKS_DIR" and writing a JSON file per failed task in the specified directory.
I did the environment variable and write approach because it seems more reliable than trying to handle a global Python variable, etc. And the reason it is an environment variable is so that BaseStageAdapter does not have to propagate an additional parameter for every single stage (which I think would involve having to update the executors as well?). Open to other suggestions.
TODO:
FilePartitioningStageJsonlReader,ParquetReader, etc.FailedTasksupportnemo-curator-slurm-cliSLURM_ARRAY_TASK_COUNT> cluster limit