Skip to content

sampler_weights not available while weights is available? #25

@GnarlyMshtep

Description

@GnarlyMshtep

I did a training run yesterday that resulted in some checkpoints. Here is a partial view of my config which tinker runs

{
    // Fork of u5mmibjz step 200
    "recipe": "tinker_cookbook.recipes.apps_rl.train_step_ranged",
    "model_name": "openai/gpt-oss-120b",
    "load_checkpoint_path": "tinker://d07c70ac-ee41-5bd0-bc58-817060f69db4:train:0/weights/final",
    "total_steps": 200,
    "max_steps": 200,
    "phases_json":  //some details of my reward 
}

However replacing weights with sampler_weights makes it not work, i.e.

"load_checkpoint_path": "tinker://d07c70ac-ee41-5bd0-bc58-817060f69db4:train:0/sampler_weights/final",

raises the following error

the error trace
13:36:44 tinker_cookbook.tfh.launcher INFO Created run directory: logs/TinkerRuns/04/01/fork-u5mmibjz-exp-penalty-nonhidden_t49yfhhy
13:36:44 tinker_cookbook.tfh.launcher INFO Run ID: t49yfhhy
13:36:44 tinker_cookbook.tfh.launcher INFO Wandb: https://wandb.ai/matan-shtepel-carnegie-mellon-university/apps-tinker/runs/t49yfhhy
13:36:44 tinker_cookbook.tfh.launcher INFO Launching recipe: tinker_cookbook.recipes.apps_rl.train_step_ranged
13:36:46 tinker_cookbook.tfh.launcher INFO Registered with VFH run tracker (run_id=t49yfhhy)
13:36:46 tinker_cookbook.tfh.launcher INFO Starting training...
13:36:46 tinker.lib.internal_client_holder WARNING Your Tinker SDK version is outdated. Please upgrade to the latest version.
13:36:46 tinker.lib.public_interfaces.service_client INFO ServiceClient initialized for session d598355c-f51a-5ebd-85e5-0ba384c4bd8a
13:36:47 tinker_cookbook.checkpoint_utils INFO Using renderer from checkpoint metadata for tinker://d07c70ac-ee41-5bd0-bc58-817060f69db4:train:0/sampler_weights/final: gpt_oss_no_sysprompt
wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY.
wandb: Currently logged in as: matan-shtepel (matan-shtepel-carnegie-mellon-university) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.25.1
wandb: Run data is saved locally in logs/TinkerRuns/04/01/fork-u5mmibjz-exp-penalty-nonhidden_t49yfhhy/wandb/run-20260401_133647-t49yfhhy
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run fork-u5mmibjz-exp-penalty-nonhidden_t49yfhhy
wandb: ⭐️ View project at https://wandb.ai/matan-shtepel-carnegie-mellon-university/apps-tinker
wandb: 🚀 View run at https://wandb.ai/matan-shtepel-carnegie-mellon-university/apps-tinker/runs/t49yfhhy
13:36:49 tinker_cookbook.utils.ml_log INFO
Configuration:
  learning_rate: 2e-05
  dataset_builder: {'phases_json': '[{"start_step": 0, "end_step": null, "parquet_path":
"/shared/matan/data/apps_backdoor_w_hidden_tagged_prompt", ... ': 8, 'model_name_for_tokenizer': 'openai/gpt-oss-120b',
'renderer_name': 'gpt_oss_no_sysprompt', 'seed': 0, 'total_steps': 200}
  model_name: 'openai/gpt-oss-120b'
  max_tokens: 6144
  log_path: 'logs/TinkerRuns/04/01/fork-u5mmibjz-exp-penalty-nonhidden_t49yfhhy'
  eval_every: 20
  save_every: 20
  evaluator_builders: []
  load_checkpoint_path: 'tinker://d07c70ac-ee41-5bd0-bc58-817060f69db4:train:0/sampler_weights/final'
  renderer_name: 'gpt_oss_no_sysprompt'
  wandb_project: 'apps-tinker'
  wandb_name: 'fork-u5mmibjz-exp-penalty-nonhidden_t49yfhhy'
  wandb_run_id: 't49yfhhy'
  kl_penalty_coef: 0.0
  kl_discount_factor: 0.0
  kl_reference_config: None
  log_kl_from_base: False
  loss_fn: 'importance_sampling'
  loss_fn_config: None
  num_substeps: 1
  lora_rank: 32
  temperature: 1.0
  compute_post_kl: False
  remove_constant_reward_groups: False
  enable_trace: False
  span_chart_every: 0
  async_config: None
  stream_minibatch_config: None
  base_url: None
  ttl_seconds: 604800
  num_groups_to_log: 4
  rollout_json_export: False
  max_steps: 200
root:535 [INFO] Command line invocation: /shared/matan/code/tinker-cookbook/tinker_cookbook/tfh/__main__.py new --base-config configs/base/apps.json5 --override-config configs/03/31/fork_u5mmibjz_tent_abs_exp_penalty.json5 --description fork-u5mmibjz-exp-penalty-nonhidden
tinker_cookbook.utils.ml_log:485 [INFO] Logging to: logs/TinkerRuns/04/01/fork-u5mmibjz-exp-penalty-nonhidden_t49yfhhy
tinker_cookbook.checkpoint_utils:294 [INFO] No checkpoints found at logs/TinkerRuns/04/01/fork-u5mmibjz-exp-penalty-nonhidden_t49yfhhy/checkpoints.jsonl
tinker_cookbook.checkpoint_utils:325 [INFO] No checkpoints found with key state_path in logs/TinkerRuns/04/01/fork-u5mmibjz-exp-penalty-nonhidden_t49yfhhy
tinker.lib.internal_client_holder:326 [WARNING] Your Tinker SDK version is outdated. Please upgrade to the latest version.
tinker.lib.public_interfaces.service_client:75 [INFO] ServiceClient initialized for session 8b2eae2a-b416-5ac5-a87d-2fb256df6f0c
tinker_cookbook.checkpoint_utils:139 [INFO] Renderer metadata matches for checkpoint tinker://d07c70ac-ee41-5bd0-bc58-817060f69db4:train:0/sampler_weights/final: gpt_oss_no_sysprompt
tinker.lib.public_interfaces.service_client:159 [INFO] TrainingClient initialized for model 8b2eae2a-b416-5ac5-a87d-2fb256df6f0c:train:0
tinker.lib.telemetry:204 [INFO] Exception logged for session ID: 8b2eae2a-b416-5ac5-a87d-2fb256df6f0c
tinker.lib.telemetry:204 [INFO] Exception logged for session ID: 8b2eae2a-b416-5ac5-a87d-2fb256df6f0c
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/shared/matan/code/tinker-cookbook/tinker_cookbook/tfh/__main__.py", line 5, in <module>
    main()
  File "/shared/matan/code/tinker-cookbook/tinker_cookbook/tfh/launcher.py", line 151, in main
    _handle_new(args)
  File "/shared/matan/code/tinker-cookbook/tinker_cookbook/tfh/launcher.py", line 333, in _handle_new
    _launch_recipe(recipe_module=recipe_module, config=resolved, metadata=metadata)
  File "/shared/matan/code/tinker-cookbook/tinker_cookbook/tfh/launcher.py", line 593, in _launch_recipe
    asyncio.run(cli_main_fn(cli_config))
  File "/shared/matan/dotfiles/aiai-cluster/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/shared/matan/dotfiles/aiai-cluster/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/matan/dotfiles/aiai-cluster/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/shared/matan/code/tinker-cookbook/tinker_cookbook/recipes/apps_rl/train_step_ranged.py", line 128, in cli_main
    await main(config)
  File "/shared/matan/code/tinker-cookbook/tinker_cookbook/utils/trace.py", line 526, in async_wrapper
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/matan/code/tinker-cookbook/tinker_cookbook/rl/train.py", line 1493, in main
    training_client = await service_client.create_training_client_from_state_async(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/matan/code/tinker-cookbook/.venv/lib/python3.12/site-packages/tinker/lib/telemetry.py", line 384, in _awrapper
    return await cast(Callable[..., Awaitable[R]], func)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/matan/code/tinker-cookbook/.venv/lib/python3.12/site-packages/tinker/lib/public_interfaces/service_client.py", line 301, in create_training_client_from_state_async
    await load_future.result_async()
  File "/shared/matan/code/tinker-cookbook/.venv/lib/python3.12/site-packages/tinker/lib/public_interfaces/api_future.py", line 132, in result_async
    return await asyncio.wrap_future(self._future)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/matan/code/tinker-cookbook/.venv/lib/python3.12/site-packages/tinker/lib/telemetry.py", line 384, in _awrapper
    return await cast(Callable[..., Awaitable[R]], func)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/matan/code/tinker-cookbook/.venv/lib/python3.12/site-packages/tinker/lib/public_interfaces/training_client.py", line 564, in _load_state_impl
    future = await self.holder.execute_with_retries(_send_request)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/matan/code/tinker-cookbook/.venv/lib/python3.12/site-packages/tinker/lib/internal_client_holder.py", line 452, in execute_with_retries
{
    raise e
  File "/shared/matan/code/tinker-cookbook/.venv/lib/python3.12/site-packages/tinker/lib/internal_client_holder.py", line 413, in execute_with_retries
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/matan/code/tinker-cookbook/.venv/lib/python3.12/site-packages/tinker/lib/public_interfaces/training_client.py", line 559, in _send_request
    return await client.weights.load(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/matan/code/tinker-cookbook/.venv/lib/python3.12/site-packages/tinker/resources/weights.py", line 62, in load
    return await self._post(
           ^^^^^^^^^^^^^^^^^
  File "/shared/matan/code/tinker-cookbook/.venv/lib/python3.12/site-packages/tinker/_base_client.py", line 1230, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/matan/code/tinker-cookbook/.venv/lib/python3.12/site-packages/tinker/_base_client.py", line 1031, in request
    raise self._make_status_error_from_response(err.response) from None
tinker.BadRequestError: Error code: 400 - {'detail': 'Path is invalid'}
wandb:
wandb: 🚀 View run fork-u5mmibjz-exp-penalty-nonhidden_t49yfhhy at: https://wandb.ai/matan-shtepel-carnegie-mellon-university/apps-tinker/runs/t49yfhhy

tfh.launcher is my own convenience wrapper and it seems unlikely that it interferes. I think it would also be nice to enhance this error message with the path that the server thinks is invalid.

I don't think that I deleted the checkpoint and am not even sure if its possible to delete the sampler_weights without deleting the weights

Thanks for the great service! Sorry if I am missing something!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions