Add prime-rl training cookbook by AnandK27 · Pull Request #5 · collinear-ai/simlab

AnandK27 · 2026-04-01T20:16:03Z

Summary

Adds a new cookbook (cookbook/prime-rl-training/) that bridges SimLab trajectory collection with Prime Intellect's prime-rl for RL training of agent models
Full pipeline: collect tool-use trajectories from SimLab environments → convert to SFT datasets → build and push a verifiers environment → run hosted RL training via prime rl run
Includes trajectory converter, verifiers environment wrapper, CLI collector, SFT/RL configs, human guide, and agent SKILL.md
Ships with 3 example customer support tasks (billing disputes, API escalation, SLA issues)
Verifiers environment pushed and validated on Prime Intellect hub as collinear-simlab/simlab-tasks
End-to-end tested: RL training run completed 50 steps on Qwen/Qwen3.5-9B (run p8zaxiqbge02roi3g89o7wdg)

Test plan

uv sync installs all dependencies from simlab repo
simlab env init + simlab tasks-gen generate environment and tasks
python -m prime_rl_training.collect sft converts artifacts to SFT dataset
Verifiers environment loads correctly (load_environment() returns SingleTurnEnv)
Environment pushed to Prime Intellect hub (prime env push)
RL training run completed on Prime Intellect platform (prime rl run)
Checkpoints produced (5 checkpoints at steps 25-45)

🤖 Generated with Claude Code

Cookbook that bridges SimLab's task execution with Prime Intellect's prime-rl for RL training of agent models. The full pipeline: - Collect tool-use trajectories from SimLab environments - Convert to SFT datasets (HuggingFace messages format) - Build and push a verifiers environment to Prime Intellect hub - Run hosted RL training via `prime rl run` - Evaluate trained models back through SimLab Includes example customer support tasks, quality+completeness rubrics, and configs for both SFT warmup and RL training on Qwen3.5-9B. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Import Environment, SingleTurnEnv, Parser, Rubric directly from verifiers submodules instead of via lazy vf.* attributes - Change messages list type to dict[str, Any] to allow tool_calls values Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

aditj · 2026-04-01T20:24:46Z

Is there a reason to inclyde the 3 sample generated tasks?
How should we document the experiment you did for 30 steps? any reward curves we can post on X

- Replace placeholder task prompts with actual generated SimLab customer support tasks (enterprise escalation, billing dispute, SLA breach) - Bump env version to 0.2.0 - Switch default RL model to Qwen/Qwen3.5-4B Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The rubric functions received completion as a Messages list (not a plain string). Thinking models put output in reasoning_content with content=null, so the old string-based rubric always scored 0. Added _extract_text() helper that handles both plain strings and Messages lists, extracting content + reasoning_content from all assistant messages. Tested: Step 0 reward went from 0.0 to 0.4813. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- P1: Stop run_pipeline.sh from overwriting committed env files - P2: Add try/except ValueError for reward.txt parsing - P2: Remove dead code in artifacts_to_messages tool_calls loop - P2: Change default agent model from gpt-5.2 to gpt-4.1-mini Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AnandK27 and others added 3 commits April 1, 2026 00:25

Remove unused verifiers import flagged by ruff F401

e6aa4dc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AnandK27 and others added 3 commits April 1, 2026 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prime-rl training cookbook#5

Add prime-rl training cookbook#5
AnandK27 wants to merge 6 commits intomainfrom
cookbook/prime-rl-training

AnandK27 commented Apr 1, 2026

Uh oh!

aditj commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AnandK27 commented Apr 1, 2026

Summary

Test plan

Uh oh!

aditj commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants