Skip to content

Strip user-set task_id from tutorials & getting-started script#2058

Merged
abhinavg4 merged 1 commit into
mainfrom
abhinavg/fix-math-tutorial-task-id
Jun 8, 2026
Merged

Strip user-set task_id from tutorials & getting-started script#2058
abhinavg4 merged 1 commit into
mainfrom
abhinavg/fix-math-tutorial-task-id

Conversation

@abhinavg4

Copy link
Copy Markdown
Contributor

What

PR #2036 made Task.task_id init=False (framework-owned — assigned by the executor adapter at each stage boundary). The tutorials/ examples and the getting-started verify script were not swept and still pass task_id= to Task constructors, so they now crash:

TypeError: FileGroupTask.__init__() got an unexpected keyword argument 'task_id'

This was reported for the math cc_index_lookup pipeline; the same break exists across audio / text / synthetic / slurm / quickstart tutorials and the getting-started CPU verify script.

Change

Remove the task_id= kwarg at every construction site (FileGroupTask, DocumentBatch, AudioTask, SampleTask) — the framework assigns the id. Where a loop index existed only to build the removed task_id, drop the index (for _ ... / for batch in ...). Read-only uses of task_id (logging, a hash seed, the audio checkpoint payload dict) are left unchanged.

13 files, 8 insertions / 29 deletions — no behavior change beyond ids now being framework-assigned.

Labeled docs-only (tutorials/examples) so it skips CI, per @praateekmahajan.

🤖 Generated with Claude Code

PR #2036 made Task.task_id init=False (framework-owned, assigned by the
executor adapter), but the tutorials/ examples and the getting-started
verify script still passed task_id= to Task constructors (FileGroupTask,
DocumentBatch, AudioTask, SampleTask), so they crash with:
  TypeError: __init__() got an unexpected keyword argument 'task_id'
(reported for tutorials/math/1_cc_index_lookup.py).

Remove the task_id= kwarg at every construction site; the framework assigns
the id. Where a loop index existed only to build the removed task_id, drop it
(for _ / for batch in ...). Read-only uses of task_id (logging, a hash seed,
the audio checkpoint payload dict) are left as-is.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Abhinav Garg <abhgarg@nvidia.com>
@abhinavg4 abhinavg4 requested a review from a team as a code owner June 8, 2026 17:22
@abhinavg4 abhinavg4 requested review from praateekmahajan and removed request for a team June 8, 2026 17:22
@copy-pr-bot

copy-pr-bot Bot commented Jun 8, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Mechanical fix removing user-supplied task_id= keyword arguments from Task constructor calls across 13 tutorial and getting-started files, following PR #2036 which made task_id a framework-owned field (init=False). Without this patch, every affected tutorial crashes with TypeError: __init__() got an unexpected keyword argument 'task_id'.

  • All 13 files have their task_id= constructor arguments removed; where the loop index existed solely to build that string, the loop variable is dropped (for i, batchfor batch, for i in rangefor _ in range).
  • Read-only accesses to task.task_id (checkpoint payload serialization, the hash-seed fallback, and logging) are intentionally left unchanged.
  • Checkpoint resume safety in the audio tutorials is preserved: the stable hash is carried in task._metadata[CKPT_HASH_KEY] across save/load cycles, so the fallback path through task.task_id in _task_hash() is never exercised on reloaded tasks.

Confidence Score: 5/5

Safe to merge — the changes are a purely mechanical removal of constructor arguments that the framework no longer accepts, with no logic alterations.

All 13 files receive identical treatment: task_id= kwargs are dropped from Task constructors, and loop variables that existed only to build those strings are eliminated. The audio-tutorial checkpoint resume path remains stable because the hash is stored in _metadata[CKPT_HASH_KEY] and propagated through save/load, so the task.task_id fallback inside _task_hash() is never reached for reloaded tasks. The comprehension-local j in the high-quality SDG pipeline's df[id] assignment is unaffected by removing the outer enumerate variable. No behavioral change beyond task IDs now being assigned by the framework.

No files require special attention — the audio checkpoint helpers (callhome_diar/run.py and single_speaker_filter/run.py) were worth verifying but are correct.

Important Files Changed

Filename Overview
tutorials/audio/callhome_diar/run.py Drops task_id= from three Task constructors (reader, mono stage, DER stage); checkpoint hashing uses session_name/audio_filepath first so resume is unaffected.
tutorials/audio/single_speaker_filter/run.py Drops task_id= from three constructors including _load_task; _metadata[CKPT_HASH_KEY] preserves the stable hash across checkpoint cycles, so resume logic is unaffected.
tutorials/quickstart.py Removes task_id=random.randint(...) from creation and task_id=task.task_id from a SentimentStage output task; both removed correctly.
tutorials/slurm/pipeline.py Loop variable i dropped along with task_id=f'task_{i:04d}', replaced with _; no other uses of i existed.
tutorials/synthetic/nemotron_cc/nemotron_cc_sdg_high_quality_example_pipeline.py Comprehension-local j in the df[id] assignment is unaffected by removing the outer enumerate variable.
tutorials/text/llama-nemotron-data-curation/filters/model_filters.py Removes task_id=batch.task_id from four separate DocumentBatch return sites across four filter/transform stages; all removals are consistent.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Task Constructor Call] --> B{task_id= kwarg present?}
    B -->|Before PR #2058| C[TypeError: unexpected keyword argument]
    B -->|After PR #2058| D[Constructor succeeds]
    D --> E[Framework executor adapter assigns task_id]
    E --> F{Is task_id used later?}
    F -->|Read-only: logging / checkpoint payload / hash seed| G[task.task_id read — unchanged]
    F -->|Loop index only used for task_id string| H[Loop variable dropped: for i → for _ ]
    G --> I[Pipeline continues normally]
    H --> I
Loading

Reviews (1): Last reviewed commit: "Strip user-set task_id from tutorials & ..." | Re-trigger Greptile

@abhinavg4 abhinavg4 enabled auto-merge (squash) June 8, 2026 18:04
@abhinavg4 abhinavg4 added the documentation Improvements or additions to documentation label Jun 8, 2026
@abhinavg4

Copy link
Copy Markdown
Contributor Author

/ok to test e83798c

@abhinavg4 abhinavg4 merged commit 8457b78 into main Jun 8, 2026
26 checks passed
Vmjkom pushed a commit to Vmjkom/Curator that referenced this pull request Jun 11, 2026
…A-NeMo#2058)

PR NVIDIA-NeMo#2036 made Task.task_id init=False (framework-owned, assigned by the
executor adapter), but the tutorials/ examples and the getting-started
verify script still passed task_id= to Task constructors (FileGroupTask,
DocumentBatch, AudioTask, SampleTask), so they crash with:
  TypeError: __init__() got an unexpected keyword argument 'task_id'
(reported for tutorials/math/1_cc_index_lookup.py).

Remove the task_id= kwarg at every construction site; the framework assigns
the id. Where a loop index existed only to build the removed task_id, drop it
(for _ / for batch in ...). Read-only uses of task_id (logging, a hash seed,
the audio checkpoint payload dict) are left as-is.

Signed-off-by: Abhinav Garg <abhgarg@nvidia.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-only documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants