Skip to content

Conversation

@RexBearIU
Copy link
Collaborator

Description

This update reorganizes the multi-host TPU reinforcement learning tutorial for MaxText, Tunix, and vLLM, adding a table of contents and revising the sections for environment setup, checkpoint conversion, and Docker image creation. It separates the steps for stable versus local builds, updates the workload submission commands for GRPO and GSPO, and adds a section for troubleshooting.

Tests

Verified the updated documentation by walking through the entire workflow, including environment setup, Docker image builds, and workload submission. The commands executed successfully as described. Attached are two test logs confirming the results.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link

codecov bot commented Dec 24, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@RexBearIU RexBearIU force-pushed the jackyf/docs/rl_multi branch 2 times, most recently from a6e8759 to e3f3b71 Compare December 31, 2025 09:28
Copy link
Collaborator

@A9isha A9isha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approval pending resolution of all comments

Thank you Jacky!

@RexBearIU RexBearIU force-pushed the jackyf/docs/rl_multi branch from 2c621cf to a45f9fe Compare January 6, 2026 04:27
@RexBearIU RexBearIU force-pushed the jackyf/docs/rl_multi branch 4 times, most recently from 48eb288 to 09d29f6 Compare January 8, 2026 02:31
Fix: Update installation instructions and Docker image references in RL on Multi-Host TPUs tutorial

fix: Update RL tutorial for clarity and workload management

fix: Improve clarity and details in RL tutorial

fix: Remove zone specification for XPK v0.14.0+

fix: Update workload variable naming in RL tutorial
@RexBearIU RexBearIU force-pushed the jackyf/docs/rl_multi branch from 09d29f6 to 7a2b4b6 Compare January 8, 2026 02:34
@copybara-service copybara-service bot merged commit c32eb92 into main Jan 8, 2026
24 checks passed
@copybara-service copybara-service bot deleted the jackyf/docs/rl_multi branch January 8, 2026 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants