Skip to content

Handle save checkpoint 409 conflicts as idempotent success in TrainingClient#21

Open
nishanthp wants to merge 2 commits into
thinking-machines-lab:mainfrom
nishanthp:fix_409_conflict_sdk
Open

Handle save checkpoint 409 conflicts as idempotent success in TrainingClient#21
nishanthp wants to merge 2 commits into
thinking-machines-lab:mainfrom
nishanthp:fix_409_conflict_sdk

Conversation

@nishanthp
Copy link
Copy Markdown

Fixes checkpoint save failures caused by retrying save operations after transient transport errors where the first attempt already succeeded server-side.

Issue context: #401 identified that retrying save_state_async / save_weights_for_sampler_async can hit 409 Conflict because the checkpoint name already exists.

What changed

  • Updated tinker/src/tinker/lib/internal_client_holder.py
    • Removed 409 from generic retryable status codes.
    • Rationale: 409 should not be universally retried across all API operations.
  • Updated tinker/src/tinker/lib/public_interfaces/training_client.py
    • Added save-operation-specific conflict recovery for:
      • save_state(...)
      • _save_weights_for_sampler_impl(...)
    • On ConflictError, TrainingClient now:
      1. lists checkpoints for the current model,
      2. finds a matching checkpoint by name + type (training/sampler),
      3. reuses the existing tinker_path as idempotent success.
    • If no matching checkpoint is found, the original error is re-raised.
  • Added tinker/tests/test_checkpoint_conflict_recovery.py
    • validates checkpoint-name matching rules
    • validates 409 is no longer marked retryable globally

Why this approach

  • Keeps retry policy conservative at the generic layer.
  • Implements idempotency where semantics are known (save checkpoints).
  • Avoids cookbook-side workarounds and addresses root cause in SDK behavior.

Related PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant