Skip to content

Add prime-rl training cookbook#5

Open
AnandK27 wants to merge 6 commits intomainfrom
cookbook/prime-rl-training
Open

Add prime-rl training cookbook#5
AnandK27 wants to merge 6 commits intomainfrom
cookbook/prime-rl-training

Conversation

@AnandK27
Copy link
Copy Markdown
Contributor

@AnandK27 AnandK27 commented Apr 1, 2026

Summary

  • Adds a new cookbook (cookbook/prime-rl-training/) that bridges SimLab trajectory collection with Prime Intellect's prime-rl for RL training of agent models
  • Full pipeline: collect tool-use trajectories from SimLab environments → convert to SFT datasets → build and push a verifiers environment → run hosted RL training via prime rl run
  • Includes trajectory converter, verifiers environment wrapper, CLI collector, SFT/RL configs, human guide, and agent SKILL.md
  • Ships with 3 example customer support tasks (billing disputes, API escalation, SLA issues)
  • Verifiers environment pushed and validated on Prime Intellect hub as collinear-simlab/simlab-tasks
  • End-to-end tested: RL training run completed 50 steps on Qwen/Qwen3.5-9B (run p8zaxiqbge02roi3g89o7wdg)

Test plan

  • uv sync installs all dependencies from simlab repo
  • simlab env init + simlab tasks-gen generate environment and tasks
  • python -m prime_rl_training.collect sft converts artifacts to SFT dataset
  • Verifiers environment loads correctly (load_environment() returns SingleTurnEnv)
  • Environment pushed to Prime Intellect hub (prime env push)
  • RL training run completed on Prime Intellect platform (prime rl run)
  • Checkpoints produced (5 checkpoints at steps 25-45)

🤖 Generated with Claude Code

AnandK27 and others added 3 commits April 1, 2026 00:25
Cookbook that bridges SimLab's task execution with Prime Intellect's
prime-rl for RL training of agent models. The full pipeline:

- Collect tool-use trajectories from SimLab environments
- Convert to SFT datasets (HuggingFace messages format)
- Build and push a verifiers environment to Prime Intellect hub
- Run hosted RL training via `prime rl run`
- Evaluate trained models back through SimLab

Includes example customer support tasks, quality+completeness rubrics,
and configs for both SFT warmup and RL training on Qwen3.5-9B.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Import Environment, SingleTurnEnv, Parser, Rubric directly from
  verifiers submodules instead of via lazy vf.* attributes
- Change messages list type to dict[str, Any] to allow tool_calls values

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aditj
Copy link
Copy Markdown
Collaborator

aditj commented Apr 1, 2026

  • Is there a reason to inclyde the 3 sample generated tasks?
  • How should we document the experiment you did for 30 steps? any reward curves we can post on X

AnandK27 and others added 3 commits April 1, 2026 13:47
- Replace placeholder task prompts with actual generated SimLab customer
  support tasks (enterprise escalation, billing dispute, SLA breach)
- Bump env version to 0.2.0
- Switch default RL model to Qwen/Qwen3.5-4B

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The rubric functions received completion as a Messages list (not a
plain string). Thinking models put output in reasoning_content with
content=null, so the old string-based rubric always scored 0.

Added _extract_text() helper that handles both plain strings and
Messages lists, extracting content + reasoning_content from all
assistant messages. Tested: Step 0 reward went from 0.0 to 0.4813.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- P1: Stop run_pipeline.sh from overwriting committed env files
- P2: Add try/except ValueError for reward.txt parsing
- P2: Remove dead code in artifacts_to_messages tool_calls loop
- P2: Change default agent model from gpt-5.2 to gpt-4.1-mini

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants