OPSD: rebase self_distill_hybrid on verl 0.7.0.7 API by HJSang · Pull Request #11 · HJSang/CRISP_Reasoning_Compression

HJSang · 2026-05-30T05:23:39Z

Summary

Brings the OPSD trainer/worker, verifier, dataset config, and SFT launch
scripts up to date with the verl 0.7.0.7 hybrid-engine API that ships in
the cluster image (mldev-verl-sglang-py312-cu13-image:1.0.206, verl
0.7.0.7+deec5d0.cu130). Drops the dead verl/ overlay path in the
setup script, drops the SFT-style _compute_val_loss machinery, and
fixes the rollout↔trainer weight-sync gap that was silently keeping
sglang generations on the initial checkpoint.

What changed

Trainer / worker (workspace/src/self_distill_hybrid/)

Wire CheckpointEngineManager in OPSDTrainer.init_workers. Push
trainer weights to sglang replicas at fit() start, sleep replicas
after every gen, re-sync after each update_opsd. Without this the
rollout was pinned to init weights for the whole run.
Rebase OPSDWorker directly on AsyncActorRolloutRefWorker; delete
the now-orphan SelfDistillWorker base class.
update_teacher: switch to Dispatch.ONE_TO_ALL (was sending a
dummy DataProto over an nd-compute dispatcher).
_opsd_update: pad train batches to actual DP world
(total_gpus / (TP * Ulysses-SP)), not total GPUs.
Drop _compute_val_loss / _build_val_data and the sd_val_files
plumbing. Generation-based _validate over data.val_files + reward
fn is the sole val signal.
_validate: drop the spurious sleep_replicas() that left rollout
asleep when the function returned.
AgentLoopManager.create: pass actor_rollout_resource_pool to
match verl PPO's call site.
main_opsd: construct val_reward_fn by directly instantiating
the legacy verl.workers.reward_manager.naive.NaiveRewardManager.
verl 0.7.x's load_reward_manager returns the experimental
async-only manager which isn't __call__-able — that crashed
_validate at the first val step.

Verifier (sd_verifier.py)

_tokenize_sequence refuses any pair whose teacher OR student
sequence would exceed max_length instead of silently truncating.
One-sided truncation misaligned teacher/student response logits
per-sample for the whole batch tail. Defensive assert added in
_opsd_training_step.
Skip-counter warning now surfaces the over-length skip ratio +
max_length so misconfigurations are loud.
Add _log_first_opsd_pair for a readable INFO-level preview of
the first (teacher, student, response) triple per batch.

Yaml (config/opsd_trainer.yaml)

Drop dead top-level custom_reward_function block; reward fn lives
under the inherited reward.custom_reward_function.
Drop sd_val_files / opsd.val_max_samples (val-loss machinery gone).

Scripts (workspace/scripts/sft/)

setup_sft.sh: stop uninstalling and re-installing verl from a local
vendored tree; the image ships matching verl. Just pip install
the two extras (math-verify, tensordict).
train_opsd.sh: drop SD_VAL_PROMPTS_PATH, the VERL_ROOT
PYTHONPATH prefix, and the broken top-level
custom_reward_function.path override (it never reached
load_reward_manager). Wire the dual-path math_verify scorer via
reward.custom_reward_function.path + .name. Default
REWARD_FN_PATH to workspace/src/rewards/dual_path_math_verify.py.
Add TOTAL_TRAINING_STEPS hard cap. Tighten with -eo pipefail.

Test plan

All seven source files + yaml parse cleanly (python3 -m ast).
Cross-checked every from verl.* import against verl 0.7.0.7
module layout — all symbols resolve.
Verified DataProto tensor-key contract producer↔consumer:
build_opsd_batch writes 8 keys; _opsd_training_step reads
the same 8.
Confirmed _validate no longer leaves replicas asleep on return
(matches verl PPO _validate pattern).
LVA1 submission f5b55303acc5540a5b75 (tu0) — runs to first
val step without the 'NaiveRewardManager' object is not callable crash.
LVA1 submission fcbed452f8d7640709d2 (tu100) — same plus
update_teacher fires twice at steps 100 and 200.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Generation-based validation with CLI overrides for custom reward function, training-steps cap, and pre-validation control.
Bug Fixes
- Reject over-length samples to avoid token misalignment; clearer skipped/total drop messages with guidance.
Refactor
- Async hybrid rollouts with checkpoint-driven weight sync; trainer/worker interfaces and teacher-update flow revised.
Chores
- Scripts enforce stricter error handling; installer uses image-provided stack and only adds missing extras; DAPO preprocessing drops validation split.
Style/Diagnostics
- Improved logging and broader package verification output (prints version and expanded package list).

Bring the OPSD trainer/worker, verifier, dataset config, and the SFT launch scripts up to date with the verl 0.7.0.7 hybrid-engine surface that ships in the cluster image. Trainer / worker (workspace/src/self_distill_hybrid) - Wire CheckpointEngineManager: construct it in OPSDTrainer.init_workers, push trainer weights to the sglang replicas at fit() start, sleep replicas after every gen, and re-sync after each update_opsd so the on-policy invariant actually holds. Without this the rollout was pinned to the initial checkpoint for the whole run. - Rebase OPSDWorker directly on AsyncActorRolloutRefWorker; delete the now-unreferenced SelfDistillWorker base class and its update_sft path. - update_teacher: switch from make_nd_compute_dataproto_dispatch_fn to Dispatch.ONE_TO_ALL so it fires once per FSDP rank with no dummy data. - _opsd_update: pad train batches to actual DP world (total_gpus / (TP * Ulysses-SP)), not total GPUs, to stop the TP*SP-fold over-padding under the current TP=2 SP=4 config. - Drop _compute_val_loss / _build_val_data and the sd_val_files plumbing. Generation-based _validate over data.val_files + a reward fn is now the sole val signal. - _validate: drop the spurious sleep_replicas() that left rollout asleep when the function returned, breaking the next iter's gen. - AgentLoopManager.create: pass actor_rollout_resource_pool to match the verl PPO trainer's call site. - main_opsd: build val_reward_fn by directly instantiating the legacy verl.workers.reward_manager.naive.NaiveRewardManager (the one verl 0.7.x's load_reward_manager returns is the experimental async-only manager and is not __call__-able, which crashed _validate at the first val step). Verifier (sd_verifier.py) - _tokenize_sequence now refuses any pair whose teacher OR student sequence would exceed max_length instead of silently truncating; the truncation case misaligned teacher/student response logits per-sample for the whole batch tail. Defensive assert in _opsd_training_step catches any future violation. - Skip-counter warning surfaces the over-length skip ratio + max_length so misconfigurations are loud. - Add _log_first_opsd_pair for a single readable INFO-level preview of the first (teacher, student, response) triple per batch. Yaml (config/opsd_trainer.yaml) - Drop the dead top-level custom_reward_function block; reward fn lives under the inherited reward.custom_reward_function and is overridden there from train_opsd.sh. - Drop sd_val_files / opsd.val_max_samples (val-loss machinery gone). Scripts (workspace/scripts/sft) - setup_sft.sh: stop uninstalling and re-installing verl from a local vendored tree; the image ships the matching verl 0.7.0.7+deec5d0.cu130 build. Just pip-install the two extras (math-verify, tensordict). - train_opsd.sh: drop SD_VAL_PROMPTS_PATH, the VERL_ROOT PYTHONPATH prefix, and the broken top-level custom_reward_function.path override (it never reached load_reward_manager). Wire the dual-path math_verify scorer via reward.custom_reward_function.path + .name. Default REWARD_FN_PATH to workspace/src/rewards/dual_path_math_verify.py. Add TOTAL_TRAINING_STEPS hard cap. Tighten with -eo pipefail. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

coderabbitai · 2026-05-30T05:23:50Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Migrate OPSD to VERL 0.7.x hybrid rollouts: use the image's verl stack and install extras, tighten train script and YAML reward config, change OPSDWorker to AsyncActorRolloutRefWorker, refactor OPSDTrainer to checkpoint-driven replica sync, and switch validation to generation/reward-based flow with stricter tokenization.

Changes

OPSD VERL 0.7.x Migration

Layer / File(s)	Summary
Environment Setup and Configuration `workspace/scripts/sft/setup_sft.sh`, `workspace/scripts/sft/train_opsd.sh`, `workspace/src/self_distill_hybrid/config/opsd_trainer.yaml`	Use container-provided `verl` stack and install only `math-verify`/`tensordict`. `train_opsd.sh` enables `set -xeo pipefail`, adjusts `PYTHONPATH` to `${SD_SRC}`, adds `REWARD_FN_PATH` and `TOTAL_TRAINING_STEPS`, removes val-file CLI injection, and wires reward CLI keys to `reward.custom_reward_function.*`. YAML migrates reward defaults to `reward@reward: reward` and enables `legacy_reward_impl`.
Trainer Hybrid Infrastructure Integration `workspace/src/self_distill_hybrid/opsd_trainer.py`	Integrates `CheckpointEngineManager` and `AgentLoopManager.create(...)`, computes `self.dp_world` for DP-axis padding, removes `val_data_path` usage, wakes and pushes weights to replicas before generation, sleeps replicas after generation to free memory, and forces resync after teacher updates; `_update_teacher_weights()` simplified.
Validation Flow Refactoring `workspace/src/self_distill_hybrid/main_opsd.py`	Removes SD-val parquet auto-detection and `load_reward_manager` use; when `config.data.val_files` is set, builds a `NaiveRewardManager` using `get_custom_reward_fn` (routed by `reward_fn_key`, `num_examine=0`); trainer receives `val_reward_fn` and `val_dataset` (no `val_data_path`).
Worker Architecture Refactoring `workspace/src/self_distill_hybrid/opsd_worker.py`	`OPSDWorker` now extends `AsyncActorRolloutRefWorker`; registration imports updated; `update_teacher` becomes a `Dispatch.ONE_TO_ALL` control method with no `DataProto` argument; shard-level param copy replaces FSDP full-parameter materialization and adds runtime shape checks.
Batch Tokenization and Logits/Entropy Alignment `workspace/src/self_distill_hybrid/sd_verifier.py`, `workspace/src/self_distill_hybrid/opsd_worker.py`	`sd_verifier.py` rejects samples exceeding `max_length`, adds `_log_first_opsd_pair` logging, and expands skipped-sample warnings. `opsd_worker.py` asserts teacher/student response-token counts match (skips zero-response cases) and weights entropy accumulation by actual response-token count (`n_resp`) instead of truncation-based `min_len`.
JSD / Entropy Comments and Chunking `workspace/src/self_distill_hybrid/opsd_worker.py`	Clarify chunking memory comments, differentiability of chunk-wise `.float()`, log-space mixture formulation for JSD, and log-space entropy computation comments.
DAPO Preprocessing Simplification `workspace/src/data/process_eval_data.py`	`process_dapo` now returns a single training `datasets.Dataset` (no train/val split), removes writing `val_dapo.parquet`, and updates printed summaries to omit DAPO validation counts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

A rabbit tunes the rollout stream with care,
Swapping truncation for truths laid bare,
Checkpoints hum as replicas wake and sleep,
Teachers copy weights while students learn deep,
Hop, VERL, hop—let hybrid training leap! 🐇✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and specifically describes the main objective of the changeset: rebasing the self_distill_hybrid module on the verl 0.7.0.7 API.
Docstring Coverage	✅ Passed	Docstring coverage is 85.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch hejian/Test_M

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Run on LVA1 with TP=2 SP=4 N_GPUS=8 crashed in `_opsd_update` with: AssertionError: only support equal chunk. Got size of DataProto 15 and chunk 2. after the verifier correctly dropped 1 over-length sample (16 → 15). The padding target was wrong: I had `dp_world = total_gpus / (TP * SP)` which gives 1 for the failing config, so the `n_samples % dp_world != 0` gate never fired and the un-padded 15-row batch reached the dispatcher. But the dispatcher's `make_nd_compute_dataproto_dispatch_fn(mesh_name="actor")` slices along the actor mesh's DP axis, and the actor mesh has dims `(dp, ulysses_sp)` only — TP is rollout-only (sglang's `tensor_model_parallel_size`) and does NOT collapse the actor's DP axis. Correct formula: `dp_world = total_gpus / ulysses_sp`. For the failing config that's 8/4 = 2, which matches the `chunks=2` the dispatcher demanded; 15 then pads to 16 and the dispatcher splits cleanly into two 8-row chunks. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

workspace/src/self_distill_hybrid/opsd_worker.py (1)

79-85: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast if actor/ref parameter layouts diverge.

zip() truncates silently. If the new worker wiring ever produces different parameter sets or ordering between actor and ref, this will copy only a prefix and still report success, leaving the teacher partially stale.

🐛 Suggested change

-                for p_student, p_teacher in zip(
-                    self.actor_module_fsdp.parameters(),
-                    self.ref_module_fsdp.parameters(),
-                ):
-                    p_teacher.data.copy_(p_student.data)
+                actor_params = dict(self.actor_module_fsdp.named_parameters())
+                ref_params = dict(self.ref_module_fsdp.named_parameters())
+                if actor_params.keys() != ref_params.keys():
+                    raise RuntimeError(
+                        "Actor/ref parameter sets diverged; refusing partial teacher sync"
+                    )
+                for name, p_student in actor_params.items():
+                    ref_params[name].data.copy_(p_student.data)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workspace/src/self_distill_hybrid/opsd_worker.py` around lines 79 - 85, The
current loop uses zip over self.actor_module_fsdp.parameters() and
self.ref_module_fsdp.parameters(), which silently truncates if the parameter
sequences differ; instead, inside the FSDP.summon_full_params blocks for
actor_module_fsdp and ref_module_fsdp, materialize both parameter lists (e.g.,
list(self.actor_module_fsdp.parameters()) and
list(self.ref_module_fsdp.parameters())), compare their lengths (and optionally
validate matching shapes/names) and raise a RuntimeError with clear details if
they diverge, and only then proceed to copy each p_student.data to
p_teacher.data using the matched lists (references: actor_module_fsdp,
ref_module_fsdp, FSDP.summon_full_params, parameters(), p_student, p_teacher).

workspace/src/self_distill_hybrid/opsd_trainer.py (1)

580-595: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keep val_before_train on its own step.

self.global_steps is incremented before the optional pre-train validation runs, so val_before_train gets logged at the same step as the first optimization step. That makes the charts and step-aligned comparisons ambiguous. Log the warm-up validation at step 0, or move the increment to after each train batch.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workspace/src/self_distill_hybrid/opsd_trainer.py` around lines 580 - 595,
The pre-training validation is logged at the wrong step because
self.global_steps is incremented before the optional val_before_train check;
move the increment so that the validation (triggered by val_before_train and
calling self._validate()) is logged at step 0: call
self.checkpoint_manager.update_weights(), run the optional validation and
logger.log(data=val_metrics, step=self.global_steps) while self.global_steps is
still 0, and only increment self.global_steps (self.global_steps += 1) after
validation (or after the first training batch) so the warm-up validation is
recorded on its own step.

🧹 Nitpick comments (2)

workspace/scripts/sft/setup_sft.sh (1)
17-17: ⚡ Quick win

Constrain the extra installs to the validated stack.

This unconstrained install can still upgrade shared transitive packages in the image, so identical runs may pick up different math-verify/tensordict behavior over time. Please install these through the repo’s pinned constraints or explicit versions instead of floating to latest.
♻️ Suggested change
-pip install math-verify tensordict
+python3 -m pip install \
+  -c "${REPO_ROOT}/workspace/requirements.txt" \
+  math-verify \
+  tensordict
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workspace/scripts/sft/setup_sft.sh` at line 17, Replace the floating pip
install in setup_sft.sh (the line "pip install math-verify tensordict") with an
installation that uses the repository's pinned constraints or explicit versions;
e.g. use "pip install -c <constraints-file> math-verify tensordict" or pin
specific versions like "math-verify==<version> tensordict==<version>" so
transitive dependency upgrades are prevented and installs are reproducible.
workspace/scripts/sft/train_opsd.sh (1)
47-47: ⚡ Quick win

Avoid adding an empty PYTHONPATH entry.

If PYTHONPATH is empty here, this expands to ${SD_SRC}:, which adds the current working directory to Python’s import path. That makes module resolution depend on where the script is launched from.
♻️ Suggested change
-export PYTHONPATH="${SD_SRC}:${PYTHONPATH}"
+export PYTHONPATH="${SD_SRC}${PYTHONPATH:+:${PYTHONPATH}}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workspace/scripts/sft/train_opsd.sh` at line 47, The export currently
unconditionally appends PYTHONPATH which can produce an empty trailing entry;
update the export of PYTHONPATH (the line referencing SD_SRC and PYTHONPATH) so
that you prepend SD_SRC but only append the existing PYTHONPATH when it is
non-empty — do this by using a shell conditional or parameter expansion that
adds the colon and PYTHONPATH only if PYTHONPATH is set, thereby avoiding adding
an empty entry that would implicitly include the current working directory.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@workspace/src/self_distill_hybrid/opsd_trainer.py`:
- Around line 97-109: The dp_world computation in _opsd_update currently uses
trainer-wide GPU counts (total_gpus) which can be larger than the role-specific
actor pool; after actor_rollout_wg is created, recompute dp_world from that
group's actual world size (e.g., use actor_rollout_wg.size or
actor_rollout_wg.world_size) and divide by (tensor_model_parallel_size *
ulysses_sequence_parallel_size) instead of using
self.config.trainer.n_gpus_per_node * nnodes; update the code in _opsd_update to
set self.dp_world based on the actor_rollout_wg world size and keep the same
denom logic (tp and sp) so padding/divisor uses the real actor worker group
size.

In `@workspace/src/self_distill_hybrid/sd_verifier.py`:
- Around line 335-340: The current logger.info call in sd_verifier.py is
emitting raw prompt and response content (logger.info(..., max_length, t_msgs,
s_msgs, len(response_text), response_preview)), which is a privacy/compliance
risk and too noisy; change this to logger.debug and sanitize/truncate the
payloads before logging (e.g., replace or mask sensitive tokens and limit
t_msgs, s_msgs, and response_preview to a fixed small length like 200 chars or a
summarised count), or alternatively log only metadata (max_length and response
length) at INFO and the sanitized/truncated content at DEBUG; update the call
site that references logger.info, max_length, t_msgs, s_msgs, response_text, and
response_preview accordingly.
- Around line 384-386: In build_opsd_batch(), the loop uses zip(teacher_prompts,
student_prompts, responses) which silently truncates mismatched lengths; add an
explicit length check comparing len(teacher_prompts), len(student_prompts), and
len(responses) (or use zip(..., strict=True) since we target Python 3.10+) and
raise a clear ValueError including the three lengths if they differ, so the
iteration cannot proceed with misaligned inputs; update the for-loop (identified
by the tuple unpacking for idx, (t_prompt, s_prompt, response_text) in
enumerate(...)) to rely on the guarded lengths or strict zip.

---

Outside diff comments:
In `@workspace/src/self_distill_hybrid/opsd_trainer.py`:
- Around line 580-595: The pre-training validation is logged at the wrong step
because self.global_steps is incremented before the optional val_before_train
check; move the increment so that the validation (triggered by val_before_train
and calling self._validate()) is logged at step 0: call
self.checkpoint_manager.update_weights(), run the optional validation and
logger.log(data=val_metrics, step=self.global_steps) while self.global_steps is
still 0, and only increment self.global_steps (self.global_steps += 1) after
validation (or after the first training batch) so the warm-up validation is
recorded on its own step.

In `@workspace/src/self_distill_hybrid/opsd_worker.py`:
- Around line 79-85: The current loop uses zip over
self.actor_module_fsdp.parameters() and self.ref_module_fsdp.parameters(), which
silently truncates if the parameter sequences differ; instead, inside the
FSDP.summon_full_params blocks for actor_module_fsdp and ref_module_fsdp,
materialize both parameter lists (e.g.,
list(self.actor_module_fsdp.parameters()) and
list(self.ref_module_fsdp.parameters())), compare their lengths (and optionally
validate matching shapes/names) and raise a RuntimeError with clear details if
they diverge, and only then proceed to copy each p_student.data to
p_teacher.data using the matched lists (references: actor_module_fsdp,
ref_module_fsdp, FSDP.summon_full_params, parameters(), p_student, p_teacher).

---

Nitpick comments:
In `@workspace/scripts/sft/setup_sft.sh`:
- Line 17: Replace the floating pip install in setup_sft.sh (the line "pip
install math-verify tensordict") with an installation that uses the repository's
pinned constraints or explicit versions; e.g. use "pip install -c
<constraints-file> math-verify tensordict" or pin specific versions like
"math-verify==<version> tensordict==<version>" so transitive dependency upgrades
are prevented and installs are reproducible.

In `@workspace/scripts/sft/train_opsd.sh`:
- Line 47: The export currently unconditionally appends PYTHONPATH which can
produce an empty trailing entry; update the export of PYTHONPATH (the line
referencing SD_SRC and PYTHONPATH) so that you prepend SD_SRC but only append
the existing PYTHONPATH when it is non-empty — do this by using a shell
conditional or parameter expansion that adds the colon and PYTHONPATH only if
PYTHONPATH is set, thereby avoiding adding an empty entry that would implicitly
include the current working directory.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 80dec8bf-93c0-4d5f-a023-102d957db186

📥 Commits

Reviewing files that changed from the base of the PR and between 310cf27 and 80ad2fa.

📒 Files selected for processing (8)

workspace/scripts/sft/setup_sft.sh
workspace/scripts/sft/train_opsd.sh
workspace/src/self_distill_hybrid/config/opsd_trainer.yaml
workspace/src/self_distill_hybrid/main_opsd.py
workspace/src/self_distill_hybrid/opsd_trainer.py
workspace/src/self_distill_hybrid/opsd_worker.py
workspace/src/self_distill_hybrid/sd_verifier.py
workspace/src/self_distill_hybrid/sd_worker.py

💤 Files with no reviewable changes (1)

workspace/src/self_distill_hybrid/sd_worker.py

coderabbitai · 2026-05-30T05:34:48Z

+        # DP world size: total GPUs / (TP * Ulysses-SP). Used as the divisor
+        # when padding batches before the nd-compute dispatcher slices along
+        # the actor mesh's DP axis. Using total GPUs (as we did previously)
+        # over-pads whenever TP*SP > 1.
+        total_gpus = self.config.trainer.n_gpus_per_node * self.config.trainer.nnodes
+        tp = self.config.actor_rollout_ref.rollout.get("tensor_model_parallel_size", 1)
+        sp = self.config.actor_rollout_ref.actor.get("ulysses_sequence_parallel_size", 1)
+        denom = max(1, int(tp) * int(sp))
+        self.dp_world = max(1, total_gpus // denom)
+        py_logger.info(
+            "DP world = %d (total_gpus=%d, TP=%d, ulysses_sp=%d)",
+            self.dp_world, total_gpus, tp, sp,
+        )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Derive dp_world from the actor worker group.

Line 101 derives the padding divisor from trainer-wide GPU config, but this trainer now explicitly allocates a role-specific actor_rollout_resource_pool. If that pool is smaller than the full trainer allocation, _opsd_update() will pad to the wrong divisor and the DP sharding path can mispartition or assert. Compute dp_world after self.actor_rollout_wg is created, using its actual world size instead.

Suggested fix

- total_gpus = self.config.trainer.n_gpus_per_node * self.config.trainer.nnodes tp = self.config.actor_rollout_ref.rollout.get("tensor_model_parallel_size", 1) sp = self.config.actor_rollout_ref.actor.get("ulysses_sequence_parallel_size", 1) - denom = max(1, int(tp) * int(sp)) - self.dp_world = max(1, total_gpus // denom) - py_logger.info( - "DP world = %d (total_gpus=%d, TP=%d, ulysses_sp=%d)", - self.dp_world, total_gpus, tp, sp, - ) + self._dp_denom = max(1, int(tp) * int(sp)) + self.dp_world = None @@ self.actor_rollout_wg = all_wg[str(Role.ActorRolloutRef)] self.actor_rollout_wg.init_model() + self.dp_world = max(1, self.actor_rollout_wg.world_size // self._dp_denom) + py_logger.info( + "DP world = %d (actor_world_size=%d, TP=%d, ulysses_sp=%d)", + self.dp_world, + self.actor_rollout_wg.world_size, + tp, + sp, + )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@workspace/src/self_distill_hybrid/opsd_trainer.py` around lines 97 - 109, The dp_world computation in _opsd_update currently uses trainer-wide GPU counts (total_gpus) which can be larger than the role-specific actor pool; after actor_rollout_wg is created, recompute dp_world from that group's actual world size (e.g., use actor_rollout_wg.size or actor_rollout_wg.world_size) and divide by (tensor_model_parallel_size * ulysses_sequence_parallel_size) instead of using self.config.trainer.n_gpus_per_node * nnodes; update the code in _opsd_update to set self.dp_world based on the actor_rollout_wg world size and keep the same denom logic (tp and sp) so padding/divisor uses the real actor worker group size.

coderabbitai · 2026-05-30T05:34:49Z

+    logger.info(
+        "OPSD batch[0] sample preview (max_length=%d):\n"
+        "  teacher_prompt (sd_prompt) messages: %s\n"
+        "  student_prompt (sft_prompt) messages: %s\n"
+        "  response (%d chars): %s",
+        max_length, t_msgs, s_msgs, len(response_text), response_preview,


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid logging raw prompt/response content at INFO level.

This emits decoded prompts and response text into routine training logs, which is a privacy/compliance risk and can create noisy, high-volume logs. Gate this behind DEBUG and truncate/sanitize prompt payloads too.

Suggested hardening

def _log_first_opsd_pair( t_prompt: str, s_prompt: str, response_text: str, max_length: int ) -> None: @@ - response_preview = response_text[:500] + ("…" if len(response_text) > 500 else "") - logger.info( + t_preview = str(t_msgs) + s_preview = str(s_msgs) + t_preview = t_preview[:500] + ("…" if len(t_preview) > 500 else "") + s_preview = s_preview[:500] + ("…" if len(s_preview) > 500 else "") + response_preview = response_text[:500] + ("…" if len(response_text) > 500 else "") + logger.debug( "OPSD batch[0] sample preview (max_length=%d):\n" " teacher_prompt (sd_prompt) messages: %s\n" " student_prompt (sft_prompt) messages: %s\n" " response (%d chars): %s", - max_length, t_msgs, s_msgs, len(response_text), response_preview, + max_length, t_preview, s_preview, len(response_text), response_preview, )

- log_first_pair = logger.isEnabledFor(logging.INFO) + log_first_pair = logger.isEnabledFor(logging.DEBUG)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@workspace/src/self_distill_hybrid/sd_verifier.py` around lines 335 - 340, The current logger.info call in sd_verifier.py is emitting raw prompt and response content (logger.info(..., max_length, t_msgs, s_msgs, len(response_text), response_preview)), which is a privacy/compliance risk and too noisy; change this to logger.debug and sanitize/truncate the payloads before logging (e.g., replace or mask sensitive tokens and limit t_msgs, s_msgs, and response_preview to a fixed small length like 200 chars or a summarised count), or alternatively log only metadata (max_length and response length) at INFO and the sanitized/truncated content at DEBUG; update the call site that references logger.info, max_length, t_msgs, s_msgs, response_text, and response_preview accordingly.

coderabbitai · 2026-05-30T05:34:49Z

+    for idx, (t_prompt, s_prompt, response_text) in enumerate(
+        zip(teacher_prompts, student_prompts, responses)
+    ):


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Locate the target file and inspect around the referenced lines ls -la workspace/src/self_distill_hybrid || true sed -n '330,430p' workspace/src/self_distill_hybrid/sd_verifier.py # Find all occurrences of the zip call (to see if strict=True or other guards exist elsewhere) rg -n "zip$teacher_prompts, student_prompts, responses$" workspace/src/self_distill_hybrid/sd_verifier.py # Look for any function that builds/assigns teacher_prompts/student_prompts/responses rg -n "teacher_prompts|student_prompts|responses" workspace/src/self_distill_hybrid/sd_verifier.py | head -n 200

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 5270

🏁 Script executed:

#!/bin/bash set -euo pipefail # Find all call sites of build_opsd_batch rg -n "build_opsd_batch\(" -S . | head -n 200 # Inspect sd_verifier.py around the function definition (broader context) sed -n '320,430p' workspace/src/self_distill_hybrid/sd_verifier.py # Inspect potential surrounding code that prepares these lists in sd_verifier.py # (look for code that defines/assigns teacher_prompts/student_prompts/responses near where build_opsd_batch is called) rg -n "teacher_prompts\s*=|student_prompts\s*=|responses\s*=" workspace/src/self_distill_hybrid/sd_verifier.py | head -n 200 # For each build_opsd_batch call site, print a small window around it # (rg output is bounded; we then sed based on file and approximate line) rg -n "build_opsd_batch\(" -S . | cut -d: -f1,2 | while read -r f l; do echo "---- $f:$l ----" sed -n "$((l-20)),$((l+60))p" "$f" done

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 4155

🏁 Script executed:

#!/bin/bash set -euo pipefail sed -n '560,720p' workspace/src/self_distill_hybrid/opsd_trainer.py sed -n '760,880p' workspace/src/self_distill_hybrid/opsd_trainer.py # Also locate the exact call to build_opsd_batch and print a tight window rg -n "opsd_batch\s*=\s*build_opsd_batch\(" workspace/src/self_distill_hybrid/opsd_trainer.py rg -n "build_opsd_batch\(" workspace/src/self_distill_hybrid/opsd_trainer.py | head -n 50

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 13185

🏁 Script executed:

#!/bin/bash set -euo pipefail # Find where non_tensor_batch keys "sd_prompt" and "sft_prompt" are set/produced rg -n "\"sd_prompt\"|\"sft_prompt\"" workspace/src/self_distill_hybrid | head -n 200 # Inspect sd_dataset.py for building those fields sed -n '1,260p' workspace/src/self_distill_hybrid/sd_dataset.py sed -n '260,520p' workspace/src/self_distill_hybrid/sd_dataset.py # Check for any DataProto/collate logic that enforces list length consistency rg -n "non_tensor_batch" workspace/src/self_distill_hybrid | head -n 200

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 11309

🏁 Script executed:

#!/bin/bash set -euo pipefail # Look for Python version spec in common project files ls -la find . -maxdepth 3 -type f $ -iname "pyproject.toml" -o -iname "setup.cfg" -o -iname "setup.py" -o -iname "requirements*.txt" -o -iname "Pipfile" -o -iname "environment*.yml" -o -iname ".python-version" $ | sed 's|^\./||' # Print relevant excerpts if present for f in pyproject.toml setup.cfg setup.py .python-version requirements.txt requirements-dev.txt environment.yml environment.yaml Pipfile; do if [ -f "$f" ]; then echo "---- $f ----" sed -n '1,200p' "$f" fi done # If pyproject exists, extract the python version constraint lines if [ -f "pyproject.toml" ]; then rg -n "python\s*=" pyproject.toml || true rg -n "requires-python" pyproject.toml || true rg -n "Programming Language :: Python" -S pyproject.toml setup.cfg setup.py || true fi

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 647

🏁 Script executed:

#!/bin/bash set -euo pipefail sed -n '1,200p' workspace/requirements.txt sed -n '1,200p' README.md # Any explicit python version files find . -maxdepth 2 -type f -iname "*python*" -o -iname ".python-version" -o -iname "*runtime*" | sed 's|^\./||' if [ -f .python-version ]; then cat .python-version; fi

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 8163

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "zip\([^)]*,\s*strict\s*=\s*True" -S workspace/src/self_distill_hybrid || true rg -n "zip\(" workspace/src/self_distill_hybrid/sd_verifier.py | head -n 50

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 268

Do not let zip silently drop mismatched inputs.

build_opsd_batch() iterates with zip(teacher_prompts, student_prompts, responses); if any upstream contract breaks and the lists differ in length, zip truncates silently and can misalign prompts and responses. Add a length guard (and/or strict=True; repo targets Python 3.10+).

Suggested fix

+ if not (len(teacher_prompts) == len(student_prompts) == len(responses)): + raise ValueError( + "build_opsd_batch expects equal lengths for " + "teacher_prompts, student_prompts, and responses" + ) for idx, (t_prompt, s_prompt, response_text) in enumerate( - zip(teacher_prompts, student_prompts, responses) + zip(teacher_prompts, student_prompts, responses, strict=True) ):

🧰 Tools

🪛 Ruff (0.15.14)

[warning] 385-385: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@workspace/src/self_distill_hybrid/sd_verifier.py` around lines 384 - 386, In build_opsd_batch(), the loop uses zip(teacher_prompts, student_prompts, responses) which silently truncates mismatched lengths; add an explicit length check comparing len(teacher_prompts), len(student_prompts), and len(responses) (or use zip(..., strict=True) since we target Python 3.10+) and raise a clear ValueError including the three lengths if they differ, so the iteration cannot proceed with misaligned inputs; update the for-loop (identified by the tuple unpacking for idx, (t_prompt, s_prompt, response_text) in enumerate(...)) to rely on the guarded lengths or strict zip.

Two small follow-ups uncovered while watching the LVA1 quick-test run of the verl 0.7.0.7 rebase. sd_verifier._log_first_opsd_pair: emit at WARNING and drop the logger.isEnabledFor(logging.INFO) gate. Hydra raises the __main__ logger to INFO but leaves the root logger at the WARNING default, so child loggers like self_distill_hybrid.sd_verifier silently swallow INFO calls when running inside the Ray actor. Same logger's WARNING messages (e.g. the over-length skip report) DO surface, which is what tipped me off — the preview was supposed to fire at idx==0 of every build_opsd_batch call but never showed up in the cluster log. Use logger.warning and drop the gate so it always emits exactly once per training step. process_eval_data.process_dapo: stop producing val_dapo.parquet. The downstream training pipeline only validates on AIME24, AIME25, and MATH-500 (the auto-detect in train_opsd.sh never picked up val_dapo). The previous 5% train/val split was therefore burning 870 examples that no consumer ever read. Return the full DAPO set as the training split; skip the val output and the no-longer- reachable summary lines. Module docstring, section banner, and process_dapo docstring updated to match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@workspace/src/data/process_eval_data.py`:
- Around line 40-45: The README still describes a 95/5 DAPO preprocessing split
and production of val_dapo.parquet but the code in process_dapo now keeps all
DAPO examples in a single training split; update the README.md documentation to
reflect this new behavior by replacing the 95/5 split description and any
references to val_dapo.parquet with text that states DAPO is processed as a
single training split (no validation output produced) and adjust any example
commands/outputs that mention val_dapo.parquet accordingly; verify and update
the README sections that previously referenced the split and the val_dapo output
so they match the process_dapo implementation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b64964a0-9a48-43a1-a164-8692647becc7

📥 Commits

Reviewing files that changed from the base of the PR and between 0a3867a and 99071c9.

📒 Files selected for processing (2)

workspace/src/data/process_eval_data.py
workspace/src/self_distill_hybrid/sd_verifier.py

🚧 Files skipped from review as they are similar to previous changes (1)

workspace/src/self_distill_hybrid/sd_verifier.py

coderabbitai · 2026-05-30T06:06:50Z

+def process_dapo(data_path):
+    """Process DAPO-Math-17k-dedup as a single training split.
+
+    The training pipeline validates on AIME24/AIME25/MATH-500 only, so the
+    previous train/val split was producing a val_dapo.parquet that no
+    downstream code consumed. Keep all examples in training instead.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update README preprocessing contract to match the new single-split behavior.

This change keeps all DAPO records in training, but README.md still documents a 95/5 DAPO split and val_dapo.parquet output (README.md Line 89 and Line 99-102). Please update docs in this PR to prevent operator confusion.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@workspace/src/data/process_eval_data.py` around lines 40 - 45, The README still describes a 95/5 DAPO preprocessing split and production of val_dapo.parquet but the code in process_dapo now keeps all DAPO examples in a single training split; update the README.md documentation to reflect this new behavior by replacing the 95/5 split description and any references to val_dapo.parquet with text that states DAPO is processed as a single training split (no validation output produced) and adjust any example commands/outputs that mention val_dapo.parquet accordingly; verify and update the README sections that previously referenced the split and the val_dapo output so they match the process_dapo implementation.

The previous update_teacher wrapped the param copy in nested FSDP.summon_full_params(...), which materialises the FULL unsharded parameters of BOTH actor and ref on every rank for the duration of the copy. For Qwen3-14B that's ~30 GiB x 2 modules = ~60 GiB on top of the sharded params already on-GPU. When teacher_update_freq > 0 fires right after a training step has loaded the actor for fwd/bwd, this puts the H100 80 GB rank over the line and the rank OOMs with: ray::WorkerDict.actor_rollout_ref_update_teacher() torch.OutOfMemoryError: Tried to allocate 5.80 GiB. GPU 0 has a total capacity of 79.19 GiB of which 2.28 GiB is free. Observed in tu1 (teacher_update_freq=1) right after val_before_train + first opsd_update on LVA1 (Flyte fcbed452f8d7640709d2 -> retry f9941301d258e4619921). tu10/20/50/100 would have hit the same wall on their first teacher copy. The actor and ref modules are wrapped with the SAME FSDP policy on the SAME DP mesh during init in verl's ActorRolloutRefWorker, so their local parameter shards correspond element-wise. Iterating .parameters() on both in lock-step and copying p_student.data into p_teacher.data does the right thing without any unsharded materialisation. Add a shape-equal assert to make any future wrap-policy drift loud. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Qwen3-14B + Ulysses-SP=4 on H100 80 GB peaks at ~72-74 GiB during the reverse-KL / JSD path inside _opsd_training_step, leaving little to no slack on top of sglang's ~2.87 GiB rollout footprint. With the default best-fit allocator we hit: torch.kl_div(input, target, ...) torch.OutOfMemoryError: Tried to allocate 150.00 MiB. GPU 0 has total 79.19 GiB of which 130.69 MiB is free. PyTorch allocated 72.16 GiB, 120 MiB reserved-unallocated. (observed on the LVA1 quick test fcbed452f8d7640709d2 step 3 and the tu20 retry fa063558d3d234157b4c step 1). 120 MiB reserved-unallocated + < 150 MiB free + 150 MiB requested is the textbook fragmentation signature; expandable_segments coalesces those holes and almost always gives a few GB back. PyTorch's own OOM message points at this setting as the first thing to try. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

workspace/src/self_distill_hybrid/opsd_worker.py (1)

295-301: ⚡ Quick win

Validate response-token counts per sample, not only in aggregate.

This assert only compares flattened totals. A malformed batch can still pass with equal totals while pairing the wrong teacher/student logits across examples. Since the contract is per-pair alignment, check the shifted loss_mask counts row-by-row before flattening.

Proposed refactor

+            t_resp_counts = t_loss_mask[:, 1:].sum(dim=1)
+            s_resp_counts = s_loss_mask[:, 1:].sum(dim=1)
+            if not torch.equal(t_resp_counts, s_resp_counts):
+                raise RuntimeError(
+                    "teacher/student response-token counts differ per sample: "
+                    f"teacher={t_resp_counts.tolist()} student={s_resp_counts.tolist()}"
+                )
             assert teacher_logits.shape[0] == student_logits.shape[0], (
                 f"teacher/student response-token count mismatch: "
                 f"{teacher_logits.shape[0]} vs {student_logits.shape[0]}. "

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workspace/src/self_distill_hybrid/opsd_worker.py` around lines 295 - 301, The
current assert only checks flattened totals (teacher_logits.shape[0] vs
student_logits.shape[0]) which can hide per-example misalignment; compute
per-sample response-token counts from the shifted loss_mask (e.g., resp_counts =
(loss_mask[:, 1:] != 0).sum(dim=1)) and use those counts to (a) assert
resp_counts.sum() equals teacher_logits.shape[0] and student_logits.shape[0],
and (b) verify that splitting/offsetting teacher_logits and student_logits into
per-sample chunks using resp_counts yields matching lengths for each pair;
update the code around teacher_logits/student_logits handling in opsd_worker.py
(and ensure this invariant complements upstream filter_overlong_prompts and
build_opsd_batch).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@workspace/src/self_distill_hybrid/opsd_worker.py`:
- Around line 94-105: The loop in update_teacher uses
zip(self.actor_module_fsdp.parameters(), self.ref_module_fsdp.parameters())
which silently truncates if the two parameter iterators differ; change the loop
to use zip(..., strict=True) so it raises immediately on length mismatch,
preserving the existing per-parameter shape check and RuntimeError behavior
(ensure the runtime environment targets Python 3.10+ as noted in README.md).

---

Nitpick comments:
In `@workspace/src/self_distill_hybrid/opsd_worker.py`:
- Around line 295-301: The current assert only checks flattened totals
(teacher_logits.shape[0] vs student_logits.shape[0]) which can hide per-example
misalignment; compute per-sample response-token counts from the shifted
loss_mask (e.g., resp_counts = (loss_mask[:, 1:] != 0).sum(dim=1)) and use those
counts to (a) assert resp_counts.sum() equals teacher_logits.shape[0] and
student_logits.shape[0], and (b) verify that splitting/offsetting teacher_logits
and student_logits into per-sample chunks using resp_counts yields matching
lengths for each pair; update the code around teacher_logits/student_logits
handling in opsd_worker.py (and ensure this invariant complements upstream
filter_overlong_prompts and build_opsd_batch).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 29287c03-b724-4118-9970-1cb73a5f78aa

📥 Commits

Reviewing files that changed from the base of the PR and between 99071c9 and 4d50e6a.

📒 Files selected for processing (1)

workspace/src/self_distill_hybrid/opsd_worker.py

coderabbitai · 2026-05-30T07:23:27Z

+            for p_student, p_teacher in zip(
+                self.actor_module_fsdp.parameters(),
+                self.ref_module_fsdp.parameters(),
+            ):
+                if p_student.shape != p_teacher.shape:
+                    raise RuntimeError(
+                        f"update_teacher shard mismatch at param {n_params}: "
+                        f"actor shard {tuple(p_student.shape)} vs ref shard "
+                        f"{tuple(p_teacher.shape)}. Both FSDP modules must "
+                        "be wrapped with the same policy on the same mesh."
+                    )
+                p_teacher.data.copy_(p_student.data)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail python - <<'PY' import sys print(f"sandbox_python={sys.version.split()[0]}") PY fd -HI '^(pyproject\.toml|setup\.cfg|setup\.py|tox\.ini|\.python-version|Dockerfile.*)$' | while read -r f; do printf '\n== %s ==\n' "$f" rg -n 'requires-python|python_requires|PYTHON_VERSION|FROM .*python:' "$f" || true done

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 100

🏁 Script executed:

#!/bin/bash set -euo pipefail # Locate likely Python version declaration files files=$(fd -t f '^(pyproject\.toml|setup\.cfg|setup\.py|tox\.ini|\.python-version|Dockerfile.*)$' . || true) echo "Found python-config files:" echo "$files" | sed 's|^| - |' # Extract python version constraints from those files (and a few common extras) echo echo "Python version constraints (if present):" for f in $files; do echo echo "== $f ==" sed -n '1,200p' "$f" | rg -n 'requires-python|python_requires|PYTHON_VERSION|FROM .*python:|^FROM python:|target-version|language-version|python_version' || true sed -n '1,200p' "$f" | rg -n '\[tool\.poetry\.dependencies\]|\[project\]|\[tool\.poetry\]' || true done # Confirm context for the teacher sync code path echo echo "Context around opsd_worker.py teacher update parameters copy loop:" fd -t f 'opsd_worker\.py$' workspace/src/self_distill_hybrid/opsd_worker.py >/dev/null 2>&1 && true python_file="workspace/src/self_distill_hybrid/opsd_worker.py" if [ -f "$python_file" ]; then rg -n 'update_teacher|teacher shard mismatch|actor_module_fsdp|ref_module_fsdp|p_teacher\.data\.copy_' "$python_file" -n -S echo # show surrounding lines (avoid dumping too much) sed -n '60,140p' "$python_file" | nl -ba | rg -n 'for p_student, p_teacher|teacher shard mismatch|copy_\(p_student|parameters\(' || true else echo "Missing: $python_file" fi

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 2746

🏁 Script executed:

#!/bin/bash set -euo pipefail # Search for common Python version constraint files candidates=$( (fd -t f 'pyproject\.toml$' . || true) (fd -t f 'setup\.cfg$' . || true) (fd -t f 'setup\.py$' . || true) (fd -t f 'tox\.ini$' . || true) (fd -t f '\.python-version$' . || true) (fd -t f '^Dockerfile' . || true) ) echo "Constraint file candidates:" if [ -z "${candidates// }" ]; then echo " (none found)" else echo "$candidates" | sed 's/^/ - /' fi # Extract Python version constraints from any found candidates if [ -n "${candidates// }" ]; then echo echo "Extracted Python version constraints:" echo "$candidates" | while read -r f; do [ -f "$f" ] || continue echo echo "== $f ==" sed -n '1,220p' "$f" | rg -n 'requires-python|python_requires|PYTHON_VERSION|FROM python:|target-version|language-version|python_version' || true done fi # Also just print the signature of zip for sanity (sandbox runtime) python - <<'PY' import sys, inspect print("sandbox python:", sys.version) print("zip signature:", inspect.signature(zip)) PY

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 824

🏁 Script executed:

#!/bin/bash set -euo pipefail # List likely CI/workflow files echo "Root files (top-level):" ls -la | head -n 60 echo echo "Searching for Python version declarations in repo files..." # Common CI markers rg -n --hidden --glob '!.git/*' \ 'actions/setup-python|python-version:|setup-python|PYTHON_VERSION|requires-python|python_requires|FROM python:|language: python|target-version' \ . || true echo echo "Searching for any usage of zip(strict=...) in the repo..." rg -n --hidden --glob '!.git/*' 'zip\([^)]*strict\s*=' . || true echo echo "Searching for docs mentioning supported Python versions..." rg -n --hidden --glob '!.git/*' \ 'Python (3\.[0-9]+|>=|minimum|supported|compatibility)|python 3\.[0-9]+' \ README* docs* .github* . || true

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 1005

🏁 Script executed:

#!/bin/bash set -euo pipefail python_file="workspace/src/self_distill_hybrid/opsd_worker.py" sed -n '80,130p' "$python_file" | nl -ba | sed -n '1,120p'

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 121

🏁 Script executed:

#!/bin/bash set -euo pipefail python_file="workspace/src/self_distill_hybrid/opsd_worker.py" echo "=== Exact zip loop block ===" # Print a tighter window around the zip() loop rg -n "for p_student, p_teacher in zip\(" "$python_file" start=$(rg -n "for p_student, p_teacher in zip\(" "$python_file" | head -n1 | cut -d: -f1) echo "Start line: $start" lo=$((start-5)) hi=$((start+30)) sed -n "${lo},${hi}p" "$python_file" | cat -n echo echo "=== Context lines 70-120 ===" sed -n '70,120p' "$python_file" | cat -n

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 4927

Fail fast when FSDP parameter iterators diverge in update_teacher. zip() truncates to the shorter iterator; if actor_module_fsdp.parameters() and ref_module_fsdp.parameters() produce different lengths, trailing teacher shards won’t be copied (yet the update still logs success). Use zip(..., strict=True) (Python 3.10+ per README.md).

Proposed fix

for p_student, p_teacher in zip( self.actor_module_fsdp.parameters(), self.ref_module_fsdp.parameters(), + strict=True, ):

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for p_student, p_teacher in zip(

self.actor_module_fsdp.parameters(),

self.ref_module_fsdp.parameters(),

):

if p_student.shape != p_teacher.shape:

raise RuntimeError(

f"update_teacher shard mismatch at param {n_params}: "

f"actor shard {tuple(p_student.shape)} vs ref shard "

f"{tuple(p_teacher.shape)}. Both FSDP modules must "

"be wrapped with the same policy on the same mesh."

)

p_teacher.data.copy_(p_student.data)

for p_student, p_teacher in zip(

self.actor_module_fsdp.parameters(),

self.ref_module_fsdp.parameters(),

strict=True,

):

if p_student.shape != p_teacher.shape:

raise RuntimeError(

f"update_teacher shard mismatch at param {n_params}: "

f"actor shard {tuple(p_student.shape)} vs ref shard "

f"{tuple(p_teacher.shape)}. Both FSDP modules must "

"be wrapped with the same policy on the same mesh."

)

p_teacher.data.copy_(p_student.data)

🧰 Tools

🪛 Ruff (0.15.14)

[warning] 94-97: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@workspace/src/self_distill_hybrid/opsd_worker.py` around lines 94 - 105, The loop in update_teacher uses zip(self.actor_module_fsdp.parameters(), self.ref_module_fsdp.parameters()) which silently truncates if the two parameter iterators differ; change the loop to use zip(..., strict=True) so it raises immediately on length mismatch, preserving the existing per-parameter shape check and RuntimeError behavior (ensure the runtime environment targets Python 3.10+ as noted in README.md).

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

workspace/scripts/sft/train_opsd.sh (1)

36-42: ⚡ Quick win

Don't clobber caller-provided allocator tuning.

This overwrites any existing PYTORCH_CUDA_ALLOC_CONF, so operators can't combine expandable_segments with other allocator knobs they already rely on.

Suggested change

-export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+if [[ -n "${PYTORCH_CUDA_ALLOC_CONF:-}" ]]; then
+    case ",${PYTORCH_CUDA_ALLOC_CONF}," in
+        *,expandable_segments:True,*|*,expandable_segments:true,*) ;;
+        *) export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF},expandable_segments:True" ;;
+    esac
+else
+    export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+fi

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workspace/scripts/sft/train_opsd.sh` around lines 36 - 42, The script
currently overwrites PYTORCH_CUDA_ALLOC_CONF which destroys caller-provided
tuning; update the export in train_opsd.sh to preserve existing settings by
checking if PYTORCH_CUDA_ALLOC_CONF is set and, if so, append
",expandable_segments:True" (avoiding duplicates) otherwise set it to
"expandable_segments:True", and ensure the code references the
PYTORCH_CUDA_ALLOC_CONF environment variable when composing the final export so
other allocator knobs remain intact.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@workspace/scripts/sft/train_opsd.sh`:
- Around line 58-63: The script always enables MLflow in the HuggingFace Trainer
by setting trainer.logger='["console","mlflow"]' which causes local runs to
create mlruns/; change the logic in workspace/scripts/sft/train_opsd.sh around
the trainer.logger assignment so that "mlflow" is only added when a tracking
backend is configured (e.g. MLFLOW_TRACKING_URI is non-empty or
FLYTE_INTERNAL_EXECUTION_PROJECT is set); implement a simple conditional that
detects those env vars and sets trainer.logger to '["console","mlflow"]' when
true and '["console"]' otherwise, referencing the trainer.logger assignment and
the sourced setup_mlflow_hf.sh for placement.

---

Nitpick comments:
In `@workspace/scripts/sft/train_opsd.sh`:
- Around line 36-42: The script currently overwrites PYTORCH_CUDA_ALLOC_CONF
which destroys caller-provided tuning; update the export in train_opsd.sh to
preserve existing settings by checking if PYTORCH_CUDA_ALLOC_CONF is set and, if
so, append ",expandable_segments:True" (avoiding duplicates) otherwise set it to
"expandable_segments:True", and ensure the code references the
PYTORCH_CUDA_ALLOC_CONF environment variable when composing the final export so
other allocator knobs remain intact.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 498686f6-65e2-4738-afca-1c04762d81b6

📥 Commits

Reviewing files that changed from the base of the PR and between 4d50e6a and 65f827f.

📒 Files selected for processing (1)

workspace/scripts/sft/train_opsd.sh

coderabbitai · 2026-05-30T07:31:09Z

+# Auto-configure MLflow tracking when running on the Flyte/Kingkong cluster.
+# Sets MLFLOW_TRACKING_URI / MLFLOW_EXPERIMENT_NAME / MLFLOW_RUN_CONTEXT from
+# Flyte execution env vars so HuggingFace Trainer (and any other downstream
+# code that respects MLFLOW_*) logs to go/mlflowui automatically. Becomes a
+# no-op if FLYTE_INTERNAL_EXECUTION_PROJECT is unset (local invocations).
+source "${SCRIPT_DIR}/setup_mlflow_hf.sh"


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In VERL 0.7.x, what happens when trainer.loggerincludes "mlflow" butMLFLOW_TRACKING_URIis unset? Does it no-op, create a localmlruns store, or fail initialization?

💡 Result:

In VERL 0.7.x, when trainer.logger includes "mlflow" but MLFLOW_TRACKING_URI is unset, the behavior is determined by the underlying MLflow Python library's default configuration rather than explicit handling within the VERL framework [1][2]. By default, when MLFLOW_TRACKING_URI is not set, MLflow initializes a local backend, typically creating a directory named mlruns in the current working directory to store tracking data [1]. VERL’s tracking.py module integrates MLflow by importing it and calling mlflow.log_metrics directly within its logging methods [2]. Because VERL does not override the default MLflow tracking URI, the system will fall back to this default local behavior [1][2]. It does not fail initialization due to the missing environment variable; instead, it proceeds to log locally [1][2]. If you wish to use a specific local path or a remote server, you must explicitly set the MLFLOW_TRACKING_URI environment variable before the trainer initializes, as documented in the VERL guidelines [1].

Citations:

1: https://verl.readthedocs.io/en/latest/advance/rollout_trace.html

2: https://verl.readthedocs.io/en/latest/_modules/verl/utils/tracking.html

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="workspace/scripts/sft/train_opsd.sh" if [ ! -f "$FILE" ]; then echo "Missing file: $FILE" exit 1 fi echo "---- head around 50-90 ----" nl -ba "$FILE" | sed -n '45,90p' echo "---- head around 260-320 ----" nl -ba "$FILE" | sed -n '260,320p' # Locate setup_mlflow_hf.sh referenced by SCRIPT_DIR # Determine SCRIPT_DIR value? Just search for the file name in repo. echo "---- find setup_mlflow_hf.sh ----" fd -a "setup_mlflow_hf.sh" . || true # Print that script if small enough; otherwise print relevant parts. SETUP="$(fd -a "setup_mlflow_hf.sh" . | head -n 1 || true)" if [ -n "${SETUP:-}" ] && [ -f "$SETUP" ]; then echo "---- file: $SETUP (first 200 lines) ----" wc -l "$SETUP" nl -ba "$SETUP" | sed -n '1,220p' fi # Search for FLYTE_INTERNAL_EXECUTION_PROJECT usage in setup or train script echo "---- rg FLYTE_INTERNAL_EXECUTION_PROJECT ----" rg -n "FLYTE_INTERNAL_EXECUTION_PROJECT" "$FILE" . || true # Search for trainer.logger usage in train_opsd.sh echo "---- rg trainer\.logger ----" rg -n "trainer\.logger" "$FILE" || true

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 168

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="workspace/scripts/sft/train_opsd.sh" if [ ! -f "$FILE" ]; then echo "Missing file: $FILE" exit 1 fi echo "---- head around 50-90 ----" nl -ba "$FILE" | sed -n '45,90p' echo "---- head around 260-320 ----" nl -ba "$FILE" | sed -n '260,320p' # Locate setup_mlflow_hf.sh referenced by SCRIPT_DIR echo "---- find setup_mlflow_hf.sh ----" SETUP="$(fd -a "setup_mlflow_hf.sh" . | head -n 1 || true)" if [ -n "${SETUP:-}" ]; then echo "Found: $SETUP" wc -l "$SETUP" echo "---- file: $SETUP (first 220 lines) ----" nl -ba "$SETUP" | sed -n '1,220p' else echo "setup_mlflow_hf.sh not found" fi echo "---- rg FLYTE_INTERNAL_EXECUTION_PROJECT in repo ----" rg -n "FLYTE_INTERNAL_EXECUTION_PROJECT" . || true echo "---- rg trainer\.logger in train_opsd.sh ----" rg -n "trainer\.logger" "$FILE" || true

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 168

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="workspace/scripts/sft/train_opsd.sh" if [ ! -f "$FILE" ]; then echo "Missing file: $FILE" exit 1 fi echo "---- head around 50-90 ----" nl -ba "$FILE" | sed -n '45,90p' echo "---- head around 260-320 ----" nl -ba "$FILE" | sed -n '260,320p' echo "---- find setup_mlflow_hf.sh ----" SETUP="$(fd -a "setup_mlflow_hf.sh" . | head -n 1 || true)" if [ -n "${SETUP:-}" ] && [ -f "$SETUP" ]; then echo "Found: $SETUP" wc -l "$SETUP" echo "---- file: $SETUP (first 220 lines) ----" nl -ba "$SETUP" | sed -n '1,220p' else echo "setup_mlflow_hf.sh not found" fi echo "---- rg FLYTE_INTERNAL_EXECUTION_PROJECT in repo ----" rg -n "FLYTE_INTERNAL_EXECUTION_PROJECT" . || true echo "---- rg trainer\.logger in train_opsd.sh ----" rg -n "trainer\.logger" "$FILE" || true

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 168

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="workspace/scripts/sft/train_opsd.sh" echo "== file exists? ==" ls -la "$FILE" || true echo "== print lines 45-90 of $FILE (with numbers) ==" python3 - <<'PY' from pathlib import Path path = Path("workspace/scripts/sft/train_opsd.sh") lines = path.read_text().splitlines() for i in range(44, 90): # 1-based -> slice index if i < len(lines): print(f"{i+1}\t{lines[i]}") PY echo "== print lines 260-320 of $FILE (with numbers) ==" python3 - <<'PY' from pathlib import Path path = Path("workspace/scripts/sft/train_opsd.sh") lines = path.read_text().splitlines() for i in range(259, 320): if i < len(lines): print(f"{i+1}\t{lines[i]}") PY echo "== find setup_mlflow_hf.sh ==" SETUP_PATHS="$(find . -type f -name 'setup_mlflow_hf.sh' 2>/dev/null || true)" echo "$SETUP_PATHS" | head -n 20 SETUP_PATH="$(echo "$SETUP_PATHS" | head -n 1)" if [ -n "${SETUP_PATH:-}" ]; then echo "== setup file line count ==" python3 - <<PY from pathlib import Path p=Path("$SETUP_PATH") print(sum(1 for _ in p.open('r',encoding='utf-8',errors='ignore'))) PY echo "== print lines 1-220 of $SETUP_PATH (with numbers) ==" python3 - <<'PY' from pathlib import Path path = Path("PLACEHOLDER") PY fi if [ -n "${SETUP_PATH:-}" ]; then python3 - <<PY from pathlib import Path path = Path("$SETUP_PATH") lines = path.read_text().splitlines() for i in range(0, min(220, len(lines))): print(f"{i+1}\t{lines[i]}") PY fi echo "== search for FLYTE_INTERNAL_EXECUTION_PROJECT ==" if command -v rg >/dev/null 2>&1; then rg -n "FLYTE_INTERNAL_EXECUTION_PROJECT" "$FILE" . || true else grep -RIn "FLYTE_INTERNAL_EXECUTION_PROJECT" "$FILE" . || true fi echo "== search for trainer.logger ==" if command -v rg >/dev/null 2>&1; then rg -n "trainer\.logger" "$FILE" || true else grep -n "trainer\.logger" "$FILE" || true fi

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 4852

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== ls workspace/scripts/sft ==" ls -la workspace/scripts/sft || true echo "== find setup_mlflow_hf.sh (case-insensitive) ==" find workspace/scripts/sft -maxdepth 1 -iname 'setup_mlflow_hf.sh' -print || true echo "== find setup_mlflow_hf.sh anywhere (case-insensitive, limited to first 20) ==" find . -type f -iname 'setup_mlflow_hf.sh' 2>/dev/null | head -n 20 || true # If the expected file exists, print it if [ -f "workspace/scripts/sft/setup_mlflow_hf.sh" ]; then echo "== setup_mlflow_hf.sh exists; line count ==" python3 - <<'PY' from pathlib import Path p=Path("workspace/scripts/sft/setup_mlflow_hf.sh") print(sum(1 for _ in p.open("r",encoding="utf-8",errors="ignore"))) PY echo "== setup_mlflow_hf.sh (first 200 lines with numbers) ==" python3 - <<'PY' from pathlib import Path path=Path("workspace/scripts/sft/setup_mlflow_hf.sh") lines=path.read_text().splitlines() for i in range(0, min(200, len(lines))): print(f"{i+1}\t{lines[i]}") PY fi echo "== rg for setup_mlflow_hf.sh references ==" rg -n "setup_mlflow_hf\.sh" . || true echo "== rg for FLYTE_INTERNAL_EXECUTION_PROJECT across repo ==" rg -n "FLYTE_INTERNAL_EXECUTION_PROJECT" . || true echo "== rg for mlflow logger handling in repo (best-effort) ==" rg -n "mlflow" workspace/scripts/sft . || true

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 1474

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== find any setup_mlflow* scripts ==" find . -type f $ -iname 'setup_mlflow*' -o -iname '*mlflow*setup*' $ | head -n 50 echo "== find mlflow setup scripts by name fragments ==" find . -type f $ -iname '*mlflow*.sh' -o -iname '*mlflow*setup*' $ | head -n 50 echo "== rg for MLFLOW_TRACKING_URI / MLFLOW_EXPERIMENT_NAME / MLFLOW_RUN_CONTEXT ==" rg -n "MLFLOW_TRACKING_URI|MLFLOW_EXPERIMENT_NAME|MLFLOW_RUN_CONTEXT" . || true echo "== rg for go/mlflowui ==" rg -n "go/mlflowui" . || true echo "== rg for mlruns directory creation ==" rg -n "mlruns" . || true echo "== rg for trainer.logger usage ==" rg -n "trainer\.logger" . || true

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 777

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="workspace/src/self_distill_hybrid/opsd_trainer.py" python3 - <<'PY' from pathlib import Path path=Path("workspace/src/self_distill_hybrid/opsd_trainer.py") lines=path.read_text().splitlines() start=540; end=650 # around 581 for i in range(start-1, min(end, len(lines))): print(f"{i+1}\t{lines[i]}") PY echo "== rg for default_backend/self.config.trainer.logger usage ==" rg -n "default_backend=self\.config\.trainer\.logger|trainer\.logger|logger" workspace/src/self_distill_hybrid/opsd_trainer.py || true echo "== rg for mlflow usage in self_distill_hybrid ==" rg -n "mlflow" workspace/src/self_distill_hybrid || true

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 7474

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== locate verl/utils/tracking.py in repo ==" find . -type f -path '*verl*utils*tracking.py' -maxdepth 8 2>/dev/null || true find . -type f -name 'tracking.py' | head -n 50 echo "== rg for 'class Tracking' ==" rg -n "class Tracking" . || true echo "== rg for 'default_backend' ==" rg -n "default_backend" . || true echo "== rg for 'mlflow' inside Tracking implementation paths (best-effort) ==" rg -n "mlflow" . -g'*.py' | head -n 50

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 368

🏁 Script executed:

#!/bin/bash set -euo pipefail python3 - <<'PY' import inspect, os, sys try: import verl.utils.tracking as tracking except Exception as e: print("Failed to import verl.utils.tracking:", e) sys.exit(0) print("Tracking module file:", tracking.__file__) src = inspect.getsource(tracking) print("---- tracking.py snippet around Tracking class ----") # Print only relevant fragments to keep output small for needle in ["class Tracking", "mlflow", "MLFLOW_TRACKING_URI", "default_backend", "tracking_uri", "set_tracking_uri"]: idx = src.find(needle) print(needle, "found" if idx!=-1 else "NOT found", "at", idx) print("---- end ----") PY python3 - <<'PY' import re, inspect import verl.utils.tracking as tracking src = inspect.getsource(tracking) # Extract likely mlflow-related functions/logic patterns = [ r"def .*mlflow.*?:.*?(?=\\n\\ndef|\\Z)", r"class Tracking.*?:.*?(?=\\n\\n)", ] print("---- quick grep-like lines containing mlflow/uri ----") for line in src.splitlines(): if "mlflow" in line.lower() or "tracking_uri" in line.lower() or "MLFLOW_" in line: print(line) PY

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 274

Gate the MLflow logger so local runs stay console-only.

workspace/scripts/sft/train_opsd.sh always sets trainer.logger='["console", "mlflow"]' (line 285). Even if setup_mlflow_hf.sh is a local no-op, VERL routes tracking to MLflow whenever "mlflow" is present; with MLFLOW_TRACKING_URI unset, MLflow falls back to its local backend (creating mlruns/), so local invocations won’t be console-only.

Suggested change

+TRAINER_LOGGER='["console"]' +if [[ -n "${FLYTE_INTERNAL_EXECUTION_PROJECT:-}" ]]; then + TRAINER_LOGGER='["console", "mlflow"]' +fi + python3 -m self_distill_hybrid.main_opsd \ @@ - trainer.logger='["console", "mlflow"]' \ + trainer.logger="${TRAINER_LOGGER}" \

🧰 Tools

🪛 Shellcheck (0.11.0)

[info] 63-63: Not following: ./setup_mlflow_hf.sh was not specified as input (see shellcheck -x).

(SC1091)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@workspace/scripts/sft/train_opsd.sh` around lines 58 - 63, The script always enables MLflow in the HuggingFace Trainer by setting trainer.logger='["console","mlflow"]' which causes local runs to create mlruns/; change the logic in workspace/scripts/sft/train_opsd.sh around the trainer.logger assignment so that "mlflow" is only added when a tracking backend is configured (e.g. MLFLOW_TRACKING_URI is non-empty or FLYTE_INTERNAL_EXECUTION_PROJECT is set); implement a simple conditional that detects those env vars and sets trainer.logger to '["console","mlflow"]' when true and '["console"]' otherwise, referencing the trainer.logger assignment and the sourced setup_mlflow_hf.sh for placement.

Commit 65f827f set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to relieve the reverse-KL / JSD path fragmentation pressure inside _opsd_training_step. But sglang's torch_memory_saver hard-asserts against this allocator in _ensure_initialized: RuntimeError: TorchMemorySaver is disabled for the current process because expandable_segments is not supported yet. triggered from sglang.srt.model_executor.model_runner.load_model -> torch_memory_saver/entrypoint.py:199. This crashes the SGLang rollout during model loading, which kills the Master replica and fails the PyTorchJob with an opaque CalledProcessError. Observed in tu50/tu100 retries (Flyte f486fd4845df14f5b930 / f782dcc0ede2e4c1babf), which were the only resubmissions whose mldev tarball was uploaded after 65f827f. tu1/tu10/tu20 retries uploaded before 65f827f and proceeded normally. Leave a comment in place so future-me doesn't re-add this. The underlying reverse-KL OOM will be addressed in opsd_worker.py via smaller chunk_size in _compute_*_loss_liger. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Both tu10 (f2e8314b96b034161b5b) and tu20 (f6c97d63f75ee4383837) OOMed in actor_rollout_ref_update_opsd with peaks ~72 GiB on H100 80 GB. The hot allocations are the per-chunk fp32 intermediates inside the JSD / reverse-KL paths in opsd_worker._compute_*_loss[_liger]: - log_softmax, exp, mixture, kl_t, kl_s — up to ~6 buffers of (chunk_size, V=151,936) fp32 each - at chunk_size=256 that's 6 * 256 * 152K * 4 = ~935 MiB per transient buffer; the autograd graph retains the student-side versions until backward, compounding pressure Cut all chunk_size defaults to 1/4: - standard JSD / reverse-KL: 512 -> 128 - liger JSD / reverse-KL: 256 -> 64 - standard entropy: 512 -> 128 - liger entropy: 256 -> 64 This shrinks each transient ~234 MiB (chunk_size=128, fp32) or ~117 MiB (chunk_size=64), giving ~3-4 GiB peak-memory relief in the JSD path. Loss math is unchanged — chunking is purely an algorithmic device. Per the goal's OOM directive, also halve train_batch_size 8 -> 4 in qwen3-14b-opsd-length-prune-{tu10,tu20}.json. With dp_world=2 the padding path stays clean (4 % 2 == 0). Effective compute per step is the same (micro_batch_size=1 means each step still backprops one sample at a time); only the gradient-accumulation count changes from 4 to 2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

coderabbitai Bot reviewed May 30, 2026

View reviewed changes

Hejian Sang and others added 2 commits May 30, 2026 07:15

coderabbitai Bot reviewed May 30, 2026

View reviewed changes

Hejian Sang and others added 2 commits May 30, 2026 07:52

Conversation

HJSang commented May 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HJSang commented May 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 30, 2026 •

edited

Loading