OPSD: rebase self_distill_hybrid on verl 0.7.0.7 API#11
Conversation
Bring the OPSD trainer/worker, verifier, dataset config, and the SFT launch scripts up to date with the verl 0.7.0.7 hybrid-engine surface that ships in the cluster image. Trainer / worker (workspace/src/self_distill_hybrid) - Wire CheckpointEngineManager: construct it in OPSDTrainer.init_workers, push trainer weights to the sglang replicas at fit() start, sleep replicas after every gen, and re-sync after each update_opsd so the on-policy invariant actually holds. Without this the rollout was pinned to the initial checkpoint for the whole run. - Rebase OPSDWorker directly on AsyncActorRolloutRefWorker; delete the now-unreferenced SelfDistillWorker base class and its update_sft path. - update_teacher: switch from make_nd_compute_dataproto_dispatch_fn to Dispatch.ONE_TO_ALL so it fires once per FSDP rank with no dummy data. - _opsd_update: pad train batches to actual DP world (total_gpus / (TP * Ulysses-SP)), not total GPUs, to stop the TP*SP-fold over-padding under the current TP=2 SP=4 config. - Drop _compute_val_loss / _build_val_data and the sd_val_files plumbing. Generation-based _validate over data.val_files + a reward fn is now the sole val signal. - _validate: drop the spurious sleep_replicas() that left rollout asleep when the function returned, breaking the next iter's gen. - AgentLoopManager.create: pass actor_rollout_resource_pool to match the verl PPO trainer's call site. - main_opsd: build val_reward_fn by directly instantiating the legacy verl.workers.reward_manager.naive.NaiveRewardManager (the one verl 0.7.x's load_reward_manager returns is the experimental async-only manager and is not __call__-able, which crashed _validate at the first val step). Verifier (sd_verifier.py) - _tokenize_sequence now refuses any pair whose teacher OR student sequence would exceed max_length instead of silently truncating; the truncation case misaligned teacher/student response logits per-sample for the whole batch tail. Defensive assert in _opsd_training_step catches any future violation. - Skip-counter warning surfaces the over-length skip ratio + max_length so misconfigurations are loud. - Add _log_first_opsd_pair for a single readable INFO-level preview of the first (teacher, student, response) triple per batch. Yaml (config/opsd_trainer.yaml) - Drop the dead top-level custom_reward_function block; reward fn lives under the inherited reward.custom_reward_function and is overridden there from train_opsd.sh. - Drop sd_val_files / opsd.val_max_samples (val-loss machinery gone). Scripts (workspace/scripts/sft) - setup_sft.sh: stop uninstalling and re-installing verl from a local vendored tree; the image ships the matching verl 0.7.0.7+deec5d0.cu130 build. Just pip-install the two extras (math-verify, tensordict). - train_opsd.sh: drop SD_VAL_PROMPTS_PATH, the VERL_ROOT PYTHONPATH prefix, and the broken top-level custom_reward_function.path override (it never reached load_reward_manager). Wire the dual-path math_verify scorer via reward.custom_reward_function.path + .name. Default REWARD_FN_PATH to workspace/src/rewards/dual_path_math_verify.py. Add TOTAL_TRAINING_STEPS hard cap. Tighten with -eo pipefail. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughMigrate OPSD to VERL 0.7.x hybrid rollouts: use the image's verl stack and install extras, tighten train script and YAML reward config, change OPSDWorker to AsyncActorRolloutRefWorker, refactor OPSDTrainer to checkpoint-driven replica sync, and switch validation to generation/reward-based flow with stricter tokenization. ChangesOPSD VERL 0.7.x Migration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Run on LVA1 with TP=2 SP=4 N_GPUS=8 crashed in `_opsd_update` with:
AssertionError: only support equal chunk.
Got size of DataProto 15 and chunk 2.
after the verifier correctly dropped 1 over-length sample (16 → 15).
The padding target was wrong: I had `dp_world = total_gpus / (TP * SP)`
which gives 1 for the failing config, so the `n_samples % dp_world != 0`
gate never fired and the un-padded 15-row batch reached the dispatcher.
But the dispatcher's `make_nd_compute_dataproto_dispatch_fn(mesh_name="actor")`
slices along the actor mesh's DP axis, and the actor mesh has dims
`(dp, ulysses_sp)` only — TP is rollout-only (sglang's
`tensor_model_parallel_size`) and does NOT collapse the actor's DP axis.
Correct formula: `dp_world = total_gpus / ulysses_sp`. For the failing
config that's 8/4 = 2, which matches the `chunks=2` the dispatcher
demanded; 15 then pads to 16 and the dispatcher splits cleanly into
two 8-row chunks.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
workspace/src/self_distill_hybrid/opsd_worker.py (1)
79-85:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFail fast if actor/ref parameter layouts diverge.
zip()truncates silently. If the new worker wiring ever produces different parameter sets or ordering between actor and ref, this will copy only a prefix and still report success, leaving the teacher partially stale.🐛 Suggested change
- for p_student, p_teacher in zip( - self.actor_module_fsdp.parameters(), - self.ref_module_fsdp.parameters(), - ): - p_teacher.data.copy_(p_student.data) + actor_params = dict(self.actor_module_fsdp.named_parameters()) + ref_params = dict(self.ref_module_fsdp.named_parameters()) + if actor_params.keys() != ref_params.keys(): + raise RuntimeError( + "Actor/ref parameter sets diverged; refusing partial teacher sync" + ) + for name, p_student in actor_params.items(): + ref_params[name].data.copy_(p_student.data)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@workspace/src/self_distill_hybrid/opsd_worker.py` around lines 79 - 85, The current loop uses zip over self.actor_module_fsdp.parameters() and self.ref_module_fsdp.parameters(), which silently truncates if the parameter sequences differ; instead, inside the FSDP.summon_full_params blocks for actor_module_fsdp and ref_module_fsdp, materialize both parameter lists (e.g., list(self.actor_module_fsdp.parameters()) and list(self.ref_module_fsdp.parameters())), compare their lengths (and optionally validate matching shapes/names) and raise a RuntimeError with clear details if they diverge, and only then proceed to copy each p_student.data to p_teacher.data using the matched lists (references: actor_module_fsdp, ref_module_fsdp, FSDP.summon_full_params, parameters(), p_student, p_teacher).workspace/src/self_distill_hybrid/opsd_trainer.py (1)
580-595:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winKeep
val_before_trainon its own step.
self.global_stepsis incremented before the optional pre-train validation runs, soval_before_traingets logged at the same step as the first optimization step. That makes the charts and step-aligned comparisons ambiguous. Log the warm-up validation at step0, or move the increment to after each train batch.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@workspace/src/self_distill_hybrid/opsd_trainer.py` around lines 580 - 595, The pre-training validation is logged at the wrong step because self.global_steps is incremented before the optional val_before_train check; move the increment so that the validation (triggered by val_before_train and calling self._validate()) is logged at step 0: call self.checkpoint_manager.update_weights(), run the optional validation and logger.log(data=val_metrics, step=self.global_steps) while self.global_steps is still 0, and only increment self.global_steps (self.global_steps += 1) after validation (or after the first training batch) so the warm-up validation is recorded on its own step.
🧹 Nitpick comments (2)
workspace/scripts/sft/setup_sft.sh (1)
17-17: ⚡ Quick winConstrain the extra installs to the validated stack.
This unconstrained install can still upgrade shared transitive packages in the image, so identical runs may pick up different
math-verify/tensordictbehavior over time. Please install these through the repo’s pinned constraints or explicit versions instead of floating to latest.♻️ Suggested change
-pip install math-verify tensordict +python3 -m pip install \ + -c "${REPO_ROOT}/workspace/requirements.txt" \ + math-verify \ + tensordict🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@workspace/scripts/sft/setup_sft.sh` at line 17, Replace the floating pip install in setup_sft.sh (the line "pip install math-verify tensordict") with an installation that uses the repository's pinned constraints or explicit versions; e.g. use "pip install -c <constraints-file> math-verify tensordict" or pin specific versions like "math-verify==<version> tensordict==<version>" so transitive dependency upgrades are prevented and installs are reproducible.workspace/scripts/sft/train_opsd.sh (1)
47-47: ⚡ Quick winAvoid adding an empty
PYTHONPATHentry.If
PYTHONPATHis empty here, this expands to${SD_SRC}:, which adds the current working directory to Python’s import path. That makes module resolution depend on where the script is launched from.♻️ Suggested change
-export PYTHONPATH="${SD_SRC}:${PYTHONPATH}" +export PYTHONPATH="${SD_SRC}${PYTHONPATH:+:${PYTHONPATH}}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@workspace/scripts/sft/train_opsd.sh` at line 47, The export currently unconditionally appends PYTHONPATH which can produce an empty trailing entry; update the export of PYTHONPATH (the line referencing SD_SRC and PYTHONPATH) so that you prepend SD_SRC but only append the existing PYTHONPATH when it is non-empty — do this by using a shell conditional or parameter expansion that adds the colon and PYTHONPATH only if PYTHONPATH is set, thereby avoiding adding an empty entry that would implicitly include the current working directory.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@workspace/src/self_distill_hybrid/opsd_trainer.py`:
- Around line 97-109: The dp_world computation in _opsd_update currently uses
trainer-wide GPU counts (total_gpus) which can be larger than the role-specific
actor pool; after actor_rollout_wg is created, recompute dp_world from that
group's actual world size (e.g., use actor_rollout_wg.size or
actor_rollout_wg.world_size) and divide by (tensor_model_parallel_size *
ulysses_sequence_parallel_size) instead of using
self.config.trainer.n_gpus_per_node * nnodes; update the code in _opsd_update to
set self.dp_world based on the actor_rollout_wg world size and keep the same
denom logic (tp and sp) so padding/divisor uses the real actor worker group
size.
In `@workspace/src/self_distill_hybrid/sd_verifier.py`:
- Around line 335-340: The current logger.info call in sd_verifier.py is
emitting raw prompt and response content (logger.info(..., max_length, t_msgs,
s_msgs, len(response_text), response_preview)), which is a privacy/compliance
risk and too noisy; change this to logger.debug and sanitize/truncate the
payloads before logging (e.g., replace or mask sensitive tokens and limit
t_msgs, s_msgs, and response_preview to a fixed small length like 200 chars or a
summarised count), or alternatively log only metadata (max_length and response
length) at INFO and the sanitized/truncated content at DEBUG; update the call
site that references logger.info, max_length, t_msgs, s_msgs, response_text, and
response_preview accordingly.
- Around line 384-386: In build_opsd_batch(), the loop uses zip(teacher_prompts,
student_prompts, responses) which silently truncates mismatched lengths; add an
explicit length check comparing len(teacher_prompts), len(student_prompts), and
len(responses) (or use zip(..., strict=True) since we target Python 3.10+) and
raise a clear ValueError including the three lengths if they differ, so the
iteration cannot proceed with misaligned inputs; update the for-loop (identified
by the tuple unpacking for idx, (t_prompt, s_prompt, response_text) in
enumerate(...)) to rely on the guarded lengths or strict zip.
---
Outside diff comments:
In `@workspace/src/self_distill_hybrid/opsd_trainer.py`:
- Around line 580-595: The pre-training validation is logged at the wrong step
because self.global_steps is incremented before the optional val_before_train
check; move the increment so that the validation (triggered by val_before_train
and calling self._validate()) is logged at step 0: call
self.checkpoint_manager.update_weights(), run the optional validation and
logger.log(data=val_metrics, step=self.global_steps) while self.global_steps is
still 0, and only increment self.global_steps (self.global_steps += 1) after
validation (or after the first training batch) so the warm-up validation is
recorded on its own step.
In `@workspace/src/self_distill_hybrid/opsd_worker.py`:
- Around line 79-85: The current loop uses zip over
self.actor_module_fsdp.parameters() and self.ref_module_fsdp.parameters(), which
silently truncates if the parameter sequences differ; instead, inside the
FSDP.summon_full_params blocks for actor_module_fsdp and ref_module_fsdp,
materialize both parameter lists (e.g.,
list(self.actor_module_fsdp.parameters()) and
list(self.ref_module_fsdp.parameters())), compare their lengths (and optionally
validate matching shapes/names) and raise a RuntimeError with clear details if
they diverge, and only then proceed to copy each p_student.data to
p_teacher.data using the matched lists (references: actor_module_fsdp,
ref_module_fsdp, FSDP.summon_full_params, parameters(), p_student, p_teacher).
---
Nitpick comments:
In `@workspace/scripts/sft/setup_sft.sh`:
- Line 17: Replace the floating pip install in setup_sft.sh (the line "pip
install math-verify tensordict") with an installation that uses the repository's
pinned constraints or explicit versions; e.g. use "pip install -c
<constraints-file> math-verify tensordict" or pin specific versions like
"math-verify==<version> tensordict==<version>" so transitive dependency upgrades
are prevented and installs are reproducible.
In `@workspace/scripts/sft/train_opsd.sh`:
- Line 47: The export currently unconditionally appends PYTHONPATH which can
produce an empty trailing entry; update the export of PYTHONPATH (the line
referencing SD_SRC and PYTHONPATH) so that you prepend SD_SRC but only append
the existing PYTHONPATH when it is non-empty — do this by using a shell
conditional or parameter expansion that adds the colon and PYTHONPATH only if
PYTHONPATH is set, thereby avoiding adding an empty entry that would implicitly
include the current working directory.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 80dec8bf-93c0-4d5f-a023-102d957db186
📒 Files selected for processing (8)
workspace/scripts/sft/setup_sft.shworkspace/scripts/sft/train_opsd.shworkspace/src/self_distill_hybrid/config/opsd_trainer.yamlworkspace/src/self_distill_hybrid/main_opsd.pyworkspace/src/self_distill_hybrid/opsd_trainer.pyworkspace/src/self_distill_hybrid/opsd_worker.pyworkspace/src/self_distill_hybrid/sd_verifier.pyworkspace/src/self_distill_hybrid/sd_worker.py
💤 Files with no reviewable changes (1)
- workspace/src/self_distill_hybrid/sd_worker.py
| # DP world size: total GPUs / (TP * Ulysses-SP). Used as the divisor | ||
| # when padding batches before the nd-compute dispatcher slices along | ||
| # the actor mesh's DP axis. Using total GPUs (as we did previously) | ||
| # over-pads whenever TP*SP > 1. | ||
| total_gpus = self.config.trainer.n_gpus_per_node * self.config.trainer.nnodes | ||
| tp = self.config.actor_rollout_ref.rollout.get("tensor_model_parallel_size", 1) | ||
| sp = self.config.actor_rollout_ref.actor.get("ulysses_sequence_parallel_size", 1) | ||
| denom = max(1, int(tp) * int(sp)) | ||
| self.dp_world = max(1, total_gpus // denom) | ||
| py_logger.info( | ||
| "DP world = %d (total_gpus=%d, TP=%d, ulysses_sp=%d)", | ||
| self.dp_world, total_gpus, tp, sp, | ||
| ) |
There was a problem hiding this comment.
Derive dp_world from the actor worker group.
Line 101 derives the padding divisor from trainer-wide GPU config, but this trainer now explicitly allocates a role-specific actor_rollout_resource_pool. If that pool is smaller than the full trainer allocation, _opsd_update() will pad to the wrong divisor and the DP sharding path can mispartition or assert. Compute dp_world after self.actor_rollout_wg is created, using its actual world size instead.
Suggested fix
- total_gpus = self.config.trainer.n_gpus_per_node * self.config.trainer.nnodes
tp = self.config.actor_rollout_ref.rollout.get("tensor_model_parallel_size", 1)
sp = self.config.actor_rollout_ref.actor.get("ulysses_sequence_parallel_size", 1)
- denom = max(1, int(tp) * int(sp))
- self.dp_world = max(1, total_gpus // denom)
- py_logger.info(
- "DP world = %d (total_gpus=%d, TP=%d, ulysses_sp=%d)",
- self.dp_world, total_gpus, tp, sp,
- )
+ self._dp_denom = max(1, int(tp) * int(sp))
+ self.dp_world = None
@@
self.actor_rollout_wg = all_wg[str(Role.ActorRolloutRef)]
self.actor_rollout_wg.init_model()
+ self.dp_world = max(1, self.actor_rollout_wg.world_size // self._dp_denom)
+ py_logger.info(
+ "DP world = %d (actor_world_size=%d, TP=%d, ulysses_sp=%d)",
+ self.dp_world,
+ self.actor_rollout_wg.world_size,
+ tp,
+ sp,
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@workspace/src/self_distill_hybrid/opsd_trainer.py` around lines 97 - 109, The
dp_world computation in _opsd_update currently uses trainer-wide GPU counts
(total_gpus) which can be larger than the role-specific actor pool; after
actor_rollout_wg is created, recompute dp_world from that group's actual world
size (e.g., use actor_rollout_wg.size or actor_rollout_wg.world_size) and divide
by (tensor_model_parallel_size * ulysses_sequence_parallel_size) instead of
using self.config.trainer.n_gpus_per_node * nnodes; update the code in
_opsd_update to set self.dp_world based on the actor_rollout_wg world size and
keep the same denom logic (tp and sp) so padding/divisor uses the real actor
worker group size.
| logger.info( | ||
| "OPSD batch[0] sample preview (max_length=%d):\n" | ||
| " teacher_prompt (sd_prompt) messages: %s\n" | ||
| " student_prompt (sft_prompt) messages: %s\n" | ||
| " response (%d chars): %s", | ||
| max_length, t_msgs, s_msgs, len(response_text), response_preview, |
There was a problem hiding this comment.
Avoid logging raw prompt/response content at INFO level.
This emits decoded prompts and response text into routine training logs, which is a privacy/compliance risk and can create noisy, high-volume logs. Gate this behind DEBUG and truncate/sanitize prompt payloads too.
Suggested hardening
def _log_first_opsd_pair(
t_prompt: str, s_prompt: str, response_text: str, max_length: int
) -> None:
@@
- response_preview = response_text[:500] + ("…" if len(response_text) > 500 else "")
- logger.info(
+ t_preview = str(t_msgs)
+ s_preview = str(s_msgs)
+ t_preview = t_preview[:500] + ("…" if len(t_preview) > 500 else "")
+ s_preview = s_preview[:500] + ("…" if len(s_preview) > 500 else "")
+ response_preview = response_text[:500] + ("…" if len(response_text) > 500 else "")
+ logger.debug(
"OPSD batch[0] sample preview (max_length=%d):\n"
" teacher_prompt (sd_prompt) messages: %s\n"
" student_prompt (sft_prompt) messages: %s\n"
" response (%d chars): %s",
- max_length, t_msgs, s_msgs, len(response_text), response_preview,
+ max_length, t_preview, s_preview, len(response_text), response_preview,
)- log_first_pair = logger.isEnabledFor(logging.INFO)
+ log_first_pair = logger.isEnabledFor(logging.DEBUG)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@workspace/src/self_distill_hybrid/sd_verifier.py` around lines 335 - 340, The
current logger.info call in sd_verifier.py is emitting raw prompt and response
content (logger.info(..., max_length, t_msgs, s_msgs, len(response_text),
response_preview)), which is a privacy/compliance risk and too noisy; change
this to logger.debug and sanitize/truncate the payloads before logging (e.g.,
replace or mask sensitive tokens and limit t_msgs, s_msgs, and response_preview
to a fixed small length like 200 chars or a summarised count), or alternatively
log only metadata (max_length and response length) at INFO and the
sanitized/truncated content at DEBUG; update the call site that references
logger.info, max_length, t_msgs, s_msgs, response_text, and response_preview
accordingly.
| for idx, (t_prompt, s_prompt, response_text) in enumerate( | ||
| zip(teacher_prompts, student_prompts, responses) | ||
| ): |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate the target file and inspect around the referenced lines
ls -la workspace/src/self_distill_hybrid || true
sed -n '330,430p' workspace/src/self_distill_hybrid/sd_verifier.py
# Find all occurrences of the zip call (to see if strict=True or other guards exist elsewhere)
rg -n "zip\(teacher_prompts, student_prompts, responses\)" workspace/src/self_distill_hybrid/sd_verifier.py
# Look for any function that builds/assigns teacher_prompts/student_prompts/responses
rg -n "teacher_prompts|student_prompts|responses" workspace/src/self_distill_hybrid/sd_verifier.py | head -n 200Repository: HJSang/CRISP_Reasoning_Compression
Length of output: 5270
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find all call sites of build_opsd_batch
rg -n "build_opsd_batch\(" -S . | head -n 200
# Inspect sd_verifier.py around the function definition (broader context)
sed -n '320,430p' workspace/src/self_distill_hybrid/sd_verifier.py
# Inspect potential surrounding code that prepares these lists in sd_verifier.py
# (look for code that defines/assigns teacher_prompts/student_prompts/responses near where build_opsd_batch is called)
rg -n "teacher_prompts\s*=|student_prompts\s*=|responses\s*=" workspace/src/self_distill_hybrid/sd_verifier.py | head -n 200
# For each build_opsd_batch call site, print a small window around it
# (rg output is bounded; we then sed based on file and approximate line)
rg -n "build_opsd_batch\(" -S . | cut -d: -f1,2 | while read -r f l; do
echo "---- $f:$l ----"
sed -n "$((l-20)),$((l+60))p" "$f"
doneRepository: HJSang/CRISP_Reasoning_Compression
Length of output: 4155
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '560,720p' workspace/src/self_distill_hybrid/opsd_trainer.py
sed -n '760,880p' workspace/src/self_distill_hybrid/opsd_trainer.py
# Also locate the exact call to build_opsd_batch and print a tight window
rg -n "opsd_batch\s*=\s*build_opsd_batch\(" workspace/src/self_distill_hybrid/opsd_trainer.py
rg -n "build_opsd_batch\(" workspace/src/self_distill_hybrid/opsd_trainer.py | head -n 50Repository: HJSang/CRISP_Reasoning_Compression
Length of output: 13185
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find where non_tensor_batch keys "sd_prompt" and "sft_prompt" are set/produced
rg -n "\"sd_prompt\"|\"sft_prompt\"" workspace/src/self_distill_hybrid | head -n 200
# Inspect sd_dataset.py for building those fields
sed -n '1,260p' workspace/src/self_distill_hybrid/sd_dataset.py
sed -n '260,520p' workspace/src/self_distill_hybrid/sd_dataset.py
# Check for any DataProto/collate logic that enforces list length consistency
rg -n "non_tensor_batch" workspace/src/self_distill_hybrid | head -n 200Repository: HJSang/CRISP_Reasoning_Compression
Length of output: 11309
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Look for Python version spec in common project files
ls -la
find . -maxdepth 3 -type f \( -iname "pyproject.toml" -o -iname "setup.cfg" -o -iname "setup.py" -o -iname "requirements*.txt" -o -iname "Pipfile" -o -iname "environment*.yml" -o -iname ".python-version" \) | sed 's|^\./||'
# Print relevant excerpts if present
for f in pyproject.toml setup.cfg setup.py .python-version requirements.txt requirements-dev.txt environment.yml environment.yaml Pipfile; do
if [ -f "$f" ]; then
echo "---- $f ----"
sed -n '1,200p' "$f"
fi
done
# If pyproject exists, extract the python version constraint lines
if [ -f "pyproject.toml" ]; then
rg -n "python\s*=" pyproject.toml || true
rg -n "requires-python" pyproject.toml || true
rg -n "Programming Language :: Python" -S pyproject.toml setup.cfg setup.py || true
fiRepository: HJSang/CRISP_Reasoning_Compression
Length of output: 647
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '1,200p' workspace/requirements.txt
sed -n '1,200p' README.md
# Any explicit python version files
find . -maxdepth 2 -type f -iname "*python*" -o -iname ".python-version" -o -iname "*runtime*" | sed 's|^\./||'
if [ -f .python-version ]; then cat .python-version; fiRepository: HJSang/CRISP_Reasoning_Compression
Length of output: 8163
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "zip\([^)]*,\s*strict\s*=\s*True" -S workspace/src/self_distill_hybrid || true
rg -n "zip\(" workspace/src/self_distill_hybrid/sd_verifier.py | head -n 50Repository: HJSang/CRISP_Reasoning_Compression
Length of output: 268
Do not let zip silently drop mismatched inputs.
build_opsd_batch() iterates with zip(teacher_prompts, student_prompts, responses); if any upstream contract breaks and the lists differ in length, zip truncates silently and can misalign prompts and responses. Add a length guard (and/or strict=True; repo targets Python 3.10+).
Suggested fix
+ if not (len(teacher_prompts) == len(student_prompts) == len(responses)):
+ raise ValueError(
+ "build_opsd_batch expects equal lengths for "
+ "teacher_prompts, student_prompts, and responses"
+ )
for idx, (t_prompt, s_prompt, response_text) in enumerate(
- zip(teacher_prompts, student_prompts, responses)
+ zip(teacher_prompts, student_prompts, responses, strict=True)
):🧰 Tools
🪛 Ruff (0.15.14)
[warning] 385-385: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@workspace/src/self_distill_hybrid/sd_verifier.py` around lines 384 - 386, In
build_opsd_batch(), the loop uses zip(teacher_prompts, student_prompts,
responses) which silently truncates mismatched lengths; add an explicit length
check comparing len(teacher_prompts), len(student_prompts), and len(responses)
(or use zip(..., strict=True) since we target Python 3.10+) and raise a clear
ValueError including the three lengths if they differ, so the iteration cannot
proceed with misaligned inputs; update the for-loop (identified by the tuple
unpacking for idx, (t_prompt, s_prompt, response_text) in enumerate(...)) to
rely on the guarded lengths or strict zip.
Two small follow-ups uncovered while watching the LVA1 quick-test run of the verl 0.7.0.7 rebase. sd_verifier._log_first_opsd_pair: emit at WARNING and drop the logger.isEnabledFor(logging.INFO) gate. Hydra raises the __main__ logger to INFO but leaves the root logger at the WARNING default, so child loggers like self_distill_hybrid.sd_verifier silently swallow INFO calls when running inside the Ray actor. Same logger's WARNING messages (e.g. the over-length skip report) DO surface, which is what tipped me off — the preview was supposed to fire at idx==0 of every build_opsd_batch call but never showed up in the cluster log. Use logger.warning and drop the gate so it always emits exactly once per training step. process_eval_data.process_dapo: stop producing val_dapo.parquet. The downstream training pipeline only validates on AIME24, AIME25, and MATH-500 (the auto-detect in train_opsd.sh never picked up val_dapo). The previous 5% train/val split was therefore burning 870 examples that no consumer ever read. Return the full DAPO set as the training split; skip the val output and the no-longer- reachable summary lines. Module docstring, section banner, and process_dapo docstring updated to match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@workspace/src/data/process_eval_data.py`:
- Around line 40-45: The README still describes a 95/5 DAPO preprocessing split
and production of val_dapo.parquet but the code in process_dapo now keeps all
DAPO examples in a single training split; update the README.md documentation to
reflect this new behavior by replacing the 95/5 split description and any
references to val_dapo.parquet with text that states DAPO is processed as a
single training split (no validation output produced) and adjust any example
commands/outputs that mention val_dapo.parquet accordingly; verify and update
the README sections that previously referenced the split and the val_dapo output
so they match the process_dapo implementation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b64964a0-9a48-43a1-a164-8692647becc7
📒 Files selected for processing (2)
workspace/src/data/process_eval_data.pyworkspace/src/self_distill_hybrid/sd_verifier.py
🚧 Files skipped from review as they are similar to previous changes (1)
- workspace/src/self_distill_hybrid/sd_verifier.py
| def process_dapo(data_path): | ||
| """Process DAPO-Math-17k-dedup as a single training split. | ||
|
|
||
| The training pipeline validates on AIME24/AIME25/MATH-500 only, so the | ||
| previous train/val split was producing a val_dapo.parquet that no | ||
| downstream code consumed. Keep all examples in training instead. |
There was a problem hiding this comment.
Update README preprocessing contract to match the new single-split behavior.
This change keeps all DAPO records in training, but README.md still documents a 95/5 DAPO split and val_dapo.parquet output (README.md Line 89 and Line 99-102). Please update docs in this PR to prevent operator confusion.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@workspace/src/data/process_eval_data.py` around lines 40 - 45, The README
still describes a 95/5 DAPO preprocessing split and production of
val_dapo.parquet but the code in process_dapo now keeps all DAPO examples in a
single training split; update the README.md documentation to reflect this new
behavior by replacing the 95/5 split description and any references to
val_dapo.parquet with text that states DAPO is processed as a single training
split (no validation output produced) and adjust any example commands/outputs
that mention val_dapo.parquet accordingly; verify and update the README sections
that previously referenced the split and the val_dapo output so they match the
process_dapo implementation.
The previous update_teacher wrapped the param copy in nested
FSDP.summon_full_params(...), which materialises the FULL unsharded
parameters of BOTH actor and ref on every rank for the duration of the
copy. For Qwen3-14B that's ~30 GiB x 2 modules = ~60 GiB on top of the
sharded params already on-GPU. When teacher_update_freq > 0 fires right
after a training step has loaded the actor for fwd/bwd, this puts the
H100 80 GB rank over the line and the rank OOMs with:
ray::WorkerDict.actor_rollout_ref_update_teacher()
torch.OutOfMemoryError: Tried to allocate 5.80 GiB.
GPU 0 has a total capacity of 79.19 GiB of which 2.28 GiB is free.
Observed in tu1 (teacher_update_freq=1) right after val_before_train +
first opsd_update on LVA1 (Flyte fcbed452f8d7640709d2 -> retry
f9941301d258e4619921). tu10/20/50/100 would have hit the same wall on
their first teacher copy.
The actor and ref modules are wrapped with the SAME FSDP policy on the
SAME DP mesh during init in verl's ActorRolloutRefWorker, so their local
parameter shards correspond element-wise. Iterating .parameters() on
both in lock-step and copying p_student.data into p_teacher.data does
the right thing without any unsharded materialisation. Add a shape-equal
assert to make any future wrap-policy drift loud.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Qwen3-14B + Ulysses-SP=4 on H100 80 GB peaks at ~72-74 GiB during the
reverse-KL / JSD path inside _opsd_training_step, leaving little to no
slack on top of sglang's ~2.87 GiB rollout footprint. With the default
best-fit allocator we hit:
torch.kl_div(input, target, ...)
torch.OutOfMemoryError: Tried to allocate 150.00 MiB.
GPU 0 has total 79.19 GiB of which 130.69 MiB is free.
PyTorch allocated 72.16 GiB, 120 MiB reserved-unallocated.
(observed on the LVA1 quick test fcbed452f8d7640709d2 step 3 and the
tu20 retry fa063558d3d234157b4c step 1). 120 MiB reserved-unallocated
+ < 150 MiB free + 150 MiB requested is the textbook fragmentation
signature; expandable_segments coalesces those holes and almost always
gives a few GB back. PyTorch's own OOM message points at this setting
as the first thing to try.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
workspace/src/self_distill_hybrid/opsd_worker.py (1)
295-301: ⚡ Quick winValidate response-token counts per sample, not only in aggregate.
This assert only compares flattened totals. A malformed batch can still pass with equal totals while pairing the wrong teacher/student logits across examples. Since the contract is per-pair alignment, check the shifted
loss_maskcounts row-by-row before flattening.Proposed refactor
+ t_resp_counts = t_loss_mask[:, 1:].sum(dim=1) + s_resp_counts = s_loss_mask[:, 1:].sum(dim=1) + if not torch.equal(t_resp_counts, s_resp_counts): + raise RuntimeError( + "teacher/student response-token counts differ per sample: " + f"teacher={t_resp_counts.tolist()} student={s_resp_counts.tolist()}" + ) assert teacher_logits.shape[0] == student_logits.shape[0], ( f"teacher/student response-token count mismatch: " f"{teacher_logits.shape[0]} vs {student_logits.shape[0]}. "🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@workspace/src/self_distill_hybrid/opsd_worker.py` around lines 295 - 301, The current assert only checks flattened totals (teacher_logits.shape[0] vs student_logits.shape[0]) which can hide per-example misalignment; compute per-sample response-token counts from the shifted loss_mask (e.g., resp_counts = (loss_mask[:, 1:] != 0).sum(dim=1)) and use those counts to (a) assert resp_counts.sum() equals teacher_logits.shape[0] and student_logits.shape[0], and (b) verify that splitting/offsetting teacher_logits and student_logits into per-sample chunks using resp_counts yields matching lengths for each pair; update the code around teacher_logits/student_logits handling in opsd_worker.py (and ensure this invariant complements upstream filter_overlong_prompts and build_opsd_batch).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@workspace/src/self_distill_hybrid/opsd_worker.py`:
- Around line 94-105: The loop in update_teacher uses
zip(self.actor_module_fsdp.parameters(), self.ref_module_fsdp.parameters())
which silently truncates if the two parameter iterators differ; change the loop
to use zip(..., strict=True) so it raises immediately on length mismatch,
preserving the existing per-parameter shape check and RuntimeError behavior
(ensure the runtime environment targets Python 3.10+ as noted in README.md).
---
Nitpick comments:
In `@workspace/src/self_distill_hybrid/opsd_worker.py`:
- Around line 295-301: The current assert only checks flattened totals
(teacher_logits.shape[0] vs student_logits.shape[0]) which can hide per-example
misalignment; compute per-sample response-token counts from the shifted
loss_mask (e.g., resp_counts = (loss_mask[:, 1:] != 0).sum(dim=1)) and use those
counts to (a) assert resp_counts.sum() equals teacher_logits.shape[0] and
student_logits.shape[0], and (b) verify that splitting/offsetting teacher_logits
and student_logits into per-sample chunks using resp_counts yields matching
lengths for each pair; update the code around teacher_logits/student_logits
handling in opsd_worker.py (and ensure this invariant complements upstream
filter_overlong_prompts and build_opsd_batch).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 29287c03-b724-4118-9970-1cb73a5f78aa
📒 Files selected for processing (1)
workspace/src/self_distill_hybrid/opsd_worker.py
| for p_student, p_teacher in zip( | ||
| self.actor_module_fsdp.parameters(), | ||
| self.ref_module_fsdp.parameters(), | ||
| ): | ||
| if p_student.shape != p_teacher.shape: | ||
| raise RuntimeError( | ||
| f"update_teacher shard mismatch at param {n_params}: " | ||
| f"actor shard {tuple(p_student.shape)} vs ref shard " | ||
| f"{tuple(p_teacher.shape)}. Both FSDP modules must " | ||
| "be wrapped with the same policy on the same mesh." | ||
| ) | ||
| p_teacher.data.copy_(p_student.data) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python - <<'PY'
import sys
print(f"sandbox_python={sys.version.split()[0]}")
PY
fd -HI '^(pyproject\.toml|setup\.cfg|setup\.py|tox\.ini|\.python-version|Dockerfile.*)$' | while read -r f; do
printf '\n== %s ==\n' "$f"
rg -n 'requires-python|python_requires|PYTHON_VERSION|FROM .*python:' "$f" || true
doneRepository: HJSang/CRISP_Reasoning_Compression
Length of output: 100
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate likely Python version declaration files
files=$(fd -t f '^(pyproject\.toml|setup\.cfg|setup\.py|tox\.ini|\.python-version|Dockerfile.*)$' . || true)
echo "Found python-config files:"
echo "$files" | sed 's|^| - |'
# Extract python version constraints from those files (and a few common extras)
echo
echo "Python version constraints (if present):"
for f in $files; do
echo
echo "== $f =="
sed -n '1,200p' "$f" | rg -n 'requires-python|python_requires|PYTHON_VERSION|FROM .*python:|^FROM python:|target-version|language-version|python_version' || true
sed -n '1,200p' "$f" | rg -n '\[tool\.poetry\.dependencies\]|\[project\]|\[tool\.poetry\]' || true
done
# Confirm context for the teacher sync code path
echo
echo "Context around opsd_worker.py teacher update parameters copy loop:"
fd -t f 'opsd_worker\.py$' workspace/src/self_distill_hybrid/opsd_worker.py >/dev/null 2>&1 && true
python_file="workspace/src/self_distill_hybrid/opsd_worker.py"
if [ -f "$python_file" ]; then
rg -n 'update_teacher|teacher shard mismatch|actor_module_fsdp|ref_module_fsdp|p_teacher\.data\.copy_' "$python_file" -n -S
echo
# show surrounding lines (avoid dumping too much)
sed -n '60,140p' "$python_file" | nl -ba | rg -n 'for p_student, p_teacher|teacher shard mismatch|copy_\(p_student|parameters\(' || true
else
echo "Missing: $python_file"
fiRepository: HJSang/CRISP_Reasoning_Compression
Length of output: 2746
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for common Python version constraint files
candidates=$(
(fd -t f 'pyproject\.toml$' . || true)
(fd -t f 'setup\.cfg$' . || true)
(fd -t f 'setup\.py$' . || true)
(fd -t f 'tox\.ini$' . || true)
(fd -t f '\.python-version$' . || true)
(fd -t f '^Dockerfile' . || true)
)
echo "Constraint file candidates:"
if [ -z "${candidates// }" ]; then
echo " (none found)"
else
echo "$candidates" | sed 's/^/ - /'
fi
# Extract Python version constraints from any found candidates
if [ -n "${candidates// }" ]; then
echo
echo "Extracted Python version constraints:"
echo "$candidates" | while read -r f; do
[ -f "$f" ] || continue
echo
echo "== $f =="
sed -n '1,220p' "$f" | rg -n 'requires-python|python_requires|PYTHON_VERSION|FROM python:|target-version|language-version|python_version' || true
done
fi
# Also just print the signature of zip for sanity (sandbox runtime)
python - <<'PY'
import sys, inspect
print("sandbox python:", sys.version)
print("zip signature:", inspect.signature(zip))
PYRepository: HJSang/CRISP_Reasoning_Compression
Length of output: 824
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# List likely CI/workflow files
echo "Root files (top-level):"
ls -la | head -n 60
echo
echo "Searching for Python version declarations in repo files..."
# Common CI markers
rg -n --hidden --glob '!.git/*' \
'actions/setup-python|python-version:|setup-python|PYTHON_VERSION|requires-python|python_requires|FROM python:|language: python|target-version' \
. || true
echo
echo "Searching for any usage of zip(strict=...) in the repo..."
rg -n --hidden --glob '!.git/*' 'zip\([^)]*strict\s*=' . || true
echo
echo "Searching for docs mentioning supported Python versions..."
rg -n --hidden --glob '!.git/*' \
'Python (3\.[0-9]+|>=|minimum|supported|compatibility)|python 3\.[0-9]+' \
README* docs* .github* . || trueRepository: HJSang/CRISP_Reasoning_Compression
Length of output: 1005
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python_file="workspace/src/self_distill_hybrid/opsd_worker.py"
sed -n '80,130p' "$python_file" | nl -ba | sed -n '1,120p'Repository: HJSang/CRISP_Reasoning_Compression
Length of output: 121
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python_file="workspace/src/self_distill_hybrid/opsd_worker.py"
echo "=== Exact zip loop block ==="
# Print a tighter window around the zip() loop
rg -n "for p_student, p_teacher in zip\(" "$python_file"
start=$(rg -n "for p_student, p_teacher in zip\(" "$python_file" | head -n1 | cut -d: -f1)
echo "Start line: $start"
lo=$((start-5))
hi=$((start+30))
sed -n "${lo},${hi}p" "$python_file" | cat -n
echo
echo "=== Context lines 70-120 ==="
sed -n '70,120p' "$python_file" | cat -nRepository: HJSang/CRISP_Reasoning_Compression
Length of output: 4927
Fail fast when FSDP parameter iterators diverge in update_teacher. zip() truncates to the shorter iterator; if actor_module_fsdp.parameters() and ref_module_fsdp.parameters() produce different lengths, trailing teacher shards won’t be copied (yet the update still logs success). Use zip(..., strict=True) (Python 3.10+ per README.md).
Proposed fix
for p_student, p_teacher in zip(
self.actor_module_fsdp.parameters(),
self.ref_module_fsdp.parameters(),
+ strict=True,
):📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for p_student, p_teacher in zip( | |
| self.actor_module_fsdp.parameters(), | |
| self.ref_module_fsdp.parameters(), | |
| ): | |
| if p_student.shape != p_teacher.shape: | |
| raise RuntimeError( | |
| f"update_teacher shard mismatch at param {n_params}: " | |
| f"actor shard {tuple(p_student.shape)} vs ref shard " | |
| f"{tuple(p_teacher.shape)}. Both FSDP modules must " | |
| "be wrapped with the same policy on the same mesh." | |
| ) | |
| p_teacher.data.copy_(p_student.data) | |
| for p_student, p_teacher in zip( | |
| self.actor_module_fsdp.parameters(), | |
| self.ref_module_fsdp.parameters(), | |
| strict=True, | |
| ): | |
| if p_student.shape != p_teacher.shape: | |
| raise RuntimeError( | |
| f"update_teacher shard mismatch at param {n_params}: " | |
| f"actor shard {tuple(p_student.shape)} vs ref shard " | |
| f"{tuple(p_teacher.shape)}. Both FSDP modules must " | |
| "be wrapped with the same policy on the same mesh." | |
| ) | |
| p_teacher.data.copy_(p_student.data) |
🧰 Tools
🪛 Ruff (0.15.14)
[warning] 94-97: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@workspace/src/self_distill_hybrid/opsd_worker.py` around lines 94 - 105, The
loop in update_teacher uses zip(self.actor_module_fsdp.parameters(),
self.ref_module_fsdp.parameters()) which silently truncates if the two parameter
iterators differ; change the loop to use zip(..., strict=True) so it raises
immediately on length mismatch, preserving the existing per-parameter shape
check and RuntimeError behavior (ensure the runtime environment targets Python
3.10+ as noted in README.md).
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
workspace/scripts/sft/train_opsd.sh (1)
36-42: ⚡ Quick winDon't clobber caller-provided allocator tuning.
This overwrites any existing
PYTORCH_CUDA_ALLOC_CONF, so operators can't combineexpandable_segmentswith other allocator knobs they already rely on.Suggested change
-export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True +if [[ -n "${PYTORCH_CUDA_ALLOC_CONF:-}" ]]; then + case ",${PYTORCH_CUDA_ALLOC_CONF}," in + *,expandable_segments:True,*|*,expandable_segments:true,*) ;; + *) export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF},expandable_segments:True" ;; + esac +else + export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True +fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@workspace/scripts/sft/train_opsd.sh` around lines 36 - 42, The script currently overwrites PYTORCH_CUDA_ALLOC_CONF which destroys caller-provided tuning; update the export in train_opsd.sh to preserve existing settings by checking if PYTORCH_CUDA_ALLOC_CONF is set and, if so, append ",expandable_segments:True" (avoiding duplicates) otherwise set it to "expandable_segments:True", and ensure the code references the PYTORCH_CUDA_ALLOC_CONF environment variable when composing the final export so other allocator knobs remain intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@workspace/scripts/sft/train_opsd.sh`:
- Around line 58-63: The script always enables MLflow in the HuggingFace Trainer
by setting trainer.logger='["console","mlflow"]' which causes local runs to
create mlruns/; change the logic in workspace/scripts/sft/train_opsd.sh around
the trainer.logger assignment so that "mlflow" is only added when a tracking
backend is configured (e.g. MLFLOW_TRACKING_URI is non-empty or
FLYTE_INTERNAL_EXECUTION_PROJECT is set); implement a simple conditional that
detects those env vars and sets trainer.logger to '["console","mlflow"]' when
true and '["console"]' otherwise, referencing the trainer.logger assignment and
the sourced setup_mlflow_hf.sh for placement.
---
Nitpick comments:
In `@workspace/scripts/sft/train_opsd.sh`:
- Around line 36-42: The script currently overwrites PYTORCH_CUDA_ALLOC_CONF
which destroys caller-provided tuning; update the export in train_opsd.sh to
preserve existing settings by checking if PYTORCH_CUDA_ALLOC_CONF is set and, if
so, append ",expandable_segments:True" (avoiding duplicates) otherwise set it to
"expandable_segments:True", and ensure the code references the
PYTORCH_CUDA_ALLOC_CONF environment variable when composing the final export so
other allocator knobs remain intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 498686f6-65e2-4738-afca-1c04762d81b6
📒 Files selected for processing (1)
workspace/scripts/sft/train_opsd.sh
| # Auto-configure MLflow tracking when running on the Flyte/Kingkong cluster. | ||
| # Sets MLFLOW_TRACKING_URI / MLFLOW_EXPERIMENT_NAME / MLFLOW_RUN_CONTEXT from | ||
| # Flyte execution env vars so HuggingFace Trainer (and any other downstream | ||
| # code that respects MLFLOW_*) logs to go/mlflowui automatically. Becomes a | ||
| # no-op if FLYTE_INTERNAL_EXECUTION_PROJECT is unset (local invocations). | ||
| source "${SCRIPT_DIR}/setup_mlflow_hf.sh" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In VERL 0.7.x, what happens when trainer.loggerincludes "mlflow" butMLFLOW_TRACKING_URIis unset? Does it no-op, create a localmlruns store, or fail initialization?
💡 Result:
In VERL 0.7.x, when trainer.logger includes "mlflow" but MLFLOW_TRACKING_URI is unset, the behavior is determined by the underlying MLflow Python library's default configuration rather than explicit handling within the VERL framework [1][2]. By default, when MLFLOW_TRACKING_URI is not set, MLflow initializes a local backend, typically creating a directory named mlruns in the current working directory to store tracking data [1]. VERL’s tracking.py module integrates MLflow by importing it and calling mlflow.log_metrics directly within its logging methods [2]. Because VERL does not override the default MLflow tracking URI, the system will fall back to this default local behavior [1][2]. It does not fail initialization due to the missing environment variable; instead, it proceeds to log locally [1][2]. If you wish to use a specific local path or a remote server, you must explicitly set the MLFLOW_TRACKING_URI environment variable before the trainer initializes, as documented in the VERL guidelines [1].
Citations:
- 1: https://verl.readthedocs.io/en/latest/advance/rollout_trace.html
- 2: https://verl.readthedocs.io/en/latest/_modules/verl/utils/tracking.html
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="workspace/scripts/sft/train_opsd.sh"
if [ ! -f "$FILE" ]; then
echo "Missing file: $FILE"
exit 1
fi
echo "---- head around 50-90 ----"
nl -ba "$FILE" | sed -n '45,90p'
echo "---- head around 260-320 ----"
nl -ba "$FILE" | sed -n '260,320p'
# Locate setup_mlflow_hf.sh referenced by SCRIPT_DIR
# Determine SCRIPT_DIR value? Just search for the file name in repo.
echo "---- find setup_mlflow_hf.sh ----"
fd -a "setup_mlflow_hf.sh" . || true
# Print that script if small enough; otherwise print relevant parts.
SETUP="$(fd -a "setup_mlflow_hf.sh" . | head -n 1 || true)"
if [ -n "${SETUP:-}" ] && [ -f "$SETUP" ]; then
echo "---- file: $SETUP (first 200 lines) ----"
wc -l "$SETUP"
nl -ba "$SETUP" | sed -n '1,220p'
fi
# Search for FLYTE_INTERNAL_EXECUTION_PROJECT usage in setup or train script
echo "---- rg FLYTE_INTERNAL_EXECUTION_PROJECT ----"
rg -n "FLYTE_INTERNAL_EXECUTION_PROJECT" "$FILE" . || true
# Search for trainer.logger usage in train_opsd.sh
echo "---- rg trainer\.logger ----"
rg -n "trainer\.logger" "$FILE" || trueRepository: HJSang/CRISP_Reasoning_Compression
Length of output: 168
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="workspace/scripts/sft/train_opsd.sh"
if [ ! -f "$FILE" ]; then
echo "Missing file: $FILE"
exit 1
fi
echo "---- head around 50-90 ----"
nl -ba "$FILE" | sed -n '45,90p'
echo "---- head around 260-320 ----"
nl -ba "$FILE" | sed -n '260,320p'
# Locate setup_mlflow_hf.sh referenced by SCRIPT_DIR
echo "---- find setup_mlflow_hf.sh ----"
SETUP="$(fd -a "setup_mlflow_hf.sh" . | head -n 1 || true)"
if [ -n "${SETUP:-}" ]; then
echo "Found: $SETUP"
wc -l "$SETUP"
echo "---- file: $SETUP (first 220 lines) ----"
nl -ba "$SETUP" | sed -n '1,220p'
else
echo "setup_mlflow_hf.sh not found"
fi
echo "---- rg FLYTE_INTERNAL_EXECUTION_PROJECT in repo ----"
rg -n "FLYTE_INTERNAL_EXECUTION_PROJECT" . || true
echo "---- rg trainer\.logger in train_opsd.sh ----"
rg -n "trainer\.logger" "$FILE" || trueRepository: HJSang/CRISP_Reasoning_Compression
Length of output: 168
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="workspace/scripts/sft/train_opsd.sh"
if [ ! -f "$FILE" ]; then
echo "Missing file: $FILE"
exit 1
fi
echo "---- head around 50-90 ----"
nl -ba "$FILE" | sed -n '45,90p'
echo "---- head around 260-320 ----"
nl -ba "$FILE" | sed -n '260,320p'
echo "---- find setup_mlflow_hf.sh ----"
SETUP="$(fd -a "setup_mlflow_hf.sh" . | head -n 1 || true)"
if [ -n "${SETUP:-}" ] && [ -f "$SETUP" ]; then
echo "Found: $SETUP"
wc -l "$SETUP"
echo "---- file: $SETUP (first 220 lines) ----"
nl -ba "$SETUP" | sed -n '1,220p'
else
echo "setup_mlflow_hf.sh not found"
fi
echo "---- rg FLYTE_INTERNAL_EXECUTION_PROJECT in repo ----"
rg -n "FLYTE_INTERNAL_EXECUTION_PROJECT" . || true
echo "---- rg trainer\.logger in train_opsd.sh ----"
rg -n "trainer\.logger" "$FILE" || trueRepository: HJSang/CRISP_Reasoning_Compression
Length of output: 168
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="workspace/scripts/sft/train_opsd.sh"
echo "== file exists? =="
ls -la "$FILE" || true
echo "== print lines 45-90 of $FILE (with numbers) =="
python3 - <<'PY'
from pathlib import Path
path = Path("workspace/scripts/sft/train_opsd.sh")
lines = path.read_text().splitlines()
for i in range(44, 90): # 1-based -> slice index
if i < len(lines):
print(f"{i+1}\t{lines[i]}")
PY
echo "== print lines 260-320 of $FILE (with numbers) =="
python3 - <<'PY'
from pathlib import Path
path = Path("workspace/scripts/sft/train_opsd.sh")
lines = path.read_text().splitlines()
for i in range(259, 320):
if i < len(lines):
print(f"{i+1}\t{lines[i]}")
PY
echo "== find setup_mlflow_hf.sh =="
SETUP_PATHS="$(find . -type f -name 'setup_mlflow_hf.sh' 2>/dev/null || true)"
echo "$SETUP_PATHS" | head -n 20
SETUP_PATH="$(echo "$SETUP_PATHS" | head -n 1)"
if [ -n "${SETUP_PATH:-}" ]; then
echo "== setup file line count =="
python3 - <<PY
from pathlib import Path
p=Path("$SETUP_PATH")
print(sum(1 for _ in p.open('r',encoding='utf-8',errors='ignore')))
PY
echo "== print lines 1-220 of $SETUP_PATH (with numbers) =="
python3 - <<'PY'
from pathlib import Path
path = Path("PLACEHOLDER")
PY
fi
if [ -n "${SETUP_PATH:-}" ]; then
python3 - <<PY
from pathlib import Path
path = Path("$SETUP_PATH")
lines = path.read_text().splitlines()
for i in range(0, min(220, len(lines))):
print(f"{i+1}\t{lines[i]}")
PY
fi
echo "== search for FLYTE_INTERNAL_EXECUTION_PROJECT =="
if command -v rg >/dev/null 2>&1; then
rg -n "FLYTE_INTERNAL_EXECUTION_PROJECT" "$FILE" . || true
else
grep -RIn "FLYTE_INTERNAL_EXECUTION_PROJECT" "$FILE" . || true
fi
echo "== search for trainer.logger =="
if command -v rg >/dev/null 2>&1; then
rg -n "trainer\.logger" "$FILE" || true
else
grep -n "trainer\.logger" "$FILE" || true
fiRepository: HJSang/CRISP_Reasoning_Compression
Length of output: 4852
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== ls workspace/scripts/sft =="
ls -la workspace/scripts/sft || true
echo "== find setup_mlflow_hf.sh (case-insensitive) =="
find workspace/scripts/sft -maxdepth 1 -iname 'setup_mlflow_hf.sh' -print || true
echo "== find setup_mlflow_hf.sh anywhere (case-insensitive, limited to first 20) =="
find . -type f -iname 'setup_mlflow_hf.sh' 2>/dev/null | head -n 20 || true
# If the expected file exists, print it
if [ -f "workspace/scripts/sft/setup_mlflow_hf.sh" ]; then
echo "== setup_mlflow_hf.sh exists; line count =="
python3 - <<'PY'
from pathlib import Path
p=Path("workspace/scripts/sft/setup_mlflow_hf.sh")
print(sum(1 for _ in p.open("r",encoding="utf-8",errors="ignore")))
PY
echo "== setup_mlflow_hf.sh (first 200 lines with numbers) =="
python3 - <<'PY'
from pathlib import Path
path=Path("workspace/scripts/sft/setup_mlflow_hf.sh")
lines=path.read_text().splitlines()
for i in range(0, min(200, len(lines))):
print(f"{i+1}\t{lines[i]}")
PY
fi
echo "== rg for setup_mlflow_hf.sh references =="
rg -n "setup_mlflow_hf\.sh" . || true
echo "== rg for FLYTE_INTERNAL_EXECUTION_PROJECT across repo =="
rg -n "FLYTE_INTERNAL_EXECUTION_PROJECT" . || true
echo "== rg for mlflow logger handling in repo (best-effort) =="
rg -n "mlflow" workspace/scripts/sft . || trueRepository: HJSang/CRISP_Reasoning_Compression
Length of output: 1474
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== find any setup_mlflow* scripts =="
find . -type f \( -iname 'setup_mlflow*' -o -iname '*mlflow*setup*' \) | head -n 50
echo "== find mlflow setup scripts by name fragments =="
find . -type f \( -iname '*mlflow*.sh' -o -iname '*mlflow*setup*' \) | head -n 50
echo "== rg for MLFLOW_TRACKING_URI / MLFLOW_EXPERIMENT_NAME / MLFLOW_RUN_CONTEXT =="
rg -n "MLFLOW_TRACKING_URI|MLFLOW_EXPERIMENT_NAME|MLFLOW_RUN_CONTEXT" . || true
echo "== rg for go/mlflowui =="
rg -n "go/mlflowui" . || true
echo "== rg for mlruns directory creation =="
rg -n "mlruns" . || true
echo "== rg for trainer.logger usage =="
rg -n "trainer\.logger" . || trueRepository: HJSang/CRISP_Reasoning_Compression
Length of output: 777
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="workspace/src/self_distill_hybrid/opsd_trainer.py"
python3 - <<'PY'
from pathlib import Path
path=Path("workspace/src/self_distill_hybrid/opsd_trainer.py")
lines=path.read_text().splitlines()
start=540; end=650 # around 581
for i in range(start-1, min(end, len(lines))):
print(f"{i+1}\t{lines[i]}")
PY
echo "== rg for default_backend/self.config.trainer.logger usage =="
rg -n "default_backend=self\.config\.trainer\.logger|trainer\.logger|logger" workspace/src/self_distill_hybrid/opsd_trainer.py || true
echo "== rg for mlflow usage in self_distill_hybrid =="
rg -n "mlflow" workspace/src/self_distill_hybrid || trueRepository: HJSang/CRISP_Reasoning_Compression
Length of output: 7474
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== locate verl/utils/tracking.py in repo =="
find . -type f -path '*verl*utils*tracking.py' -maxdepth 8 2>/dev/null || true
find . -type f -name 'tracking.py' | head -n 50
echo "== rg for 'class Tracking' =="
rg -n "class Tracking" . || true
echo "== rg for 'default_backend' =="
rg -n "default_backend" . || true
echo "== rg for 'mlflow' inside Tracking implementation paths (best-effort) =="
rg -n "mlflow" . -g'*.py' | head -n 50Repository: HJSang/CRISP_Reasoning_Compression
Length of output: 368
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python3 - <<'PY'
import inspect, os, sys
try:
import verl.utils.tracking as tracking
except Exception as e:
print("Failed to import verl.utils.tracking:", e)
sys.exit(0)
print("Tracking module file:", tracking.__file__)
src = inspect.getsource(tracking)
print("---- tracking.py snippet around Tracking class ----")
# Print only relevant fragments to keep output small
for needle in ["class Tracking", "mlflow", "MLFLOW_TRACKING_URI", "default_backend", "tracking_uri", "set_tracking_uri"]:
idx = src.find(needle)
print(needle, "found" if idx!=-1 else "NOT found", "at", idx)
print("---- end ----")
PY
python3 - <<'PY'
import re, inspect
import verl.utils.tracking as tracking
src = inspect.getsource(tracking)
# Extract likely mlflow-related functions/logic
patterns = [
r"def .*mlflow.*?:.*?(?=\\n\\ndef|\\Z)",
r"class Tracking.*?:.*?(?=\\n\\n)",
]
print("---- quick grep-like lines containing mlflow/uri ----")
for line in src.splitlines():
if "mlflow" in line.lower() or "tracking_uri" in line.lower() or "MLFLOW_" in line:
print(line)
PYRepository: HJSang/CRISP_Reasoning_Compression
Length of output: 274
Gate the MLflow logger so local runs stay console-only.
workspace/scripts/sft/train_opsd.sh always sets trainer.logger='["console", "mlflow"]' (line 285). Even if setup_mlflow_hf.sh is a local no-op, VERL routes tracking to MLflow whenever "mlflow" is present; with MLFLOW_TRACKING_URI unset, MLflow falls back to its local backend (creating mlruns/), so local invocations won’t be console-only.
Suggested change
+TRAINER_LOGGER='["console"]'
+if [[ -n "${FLYTE_INTERNAL_EXECUTION_PROJECT:-}" ]]; then
+ TRAINER_LOGGER='["console", "mlflow"]'
+fi
+
python3 -m self_distill_hybrid.main_opsd \
@@
- trainer.logger='["console", "mlflow"]' \
+ trainer.logger="${TRAINER_LOGGER}" \🧰 Tools
🪛 Shellcheck (0.11.0)
[info] 63-63: Not following: ./setup_mlflow_hf.sh was not specified as input (see shellcheck -x).
(SC1091)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@workspace/scripts/sft/train_opsd.sh` around lines 58 - 63, The script always
enables MLflow in the HuggingFace Trainer by setting
trainer.logger='["console","mlflow"]' which causes local runs to create mlruns/;
change the logic in workspace/scripts/sft/train_opsd.sh around the
trainer.logger assignment so that "mlflow" is only added when a tracking backend
is configured (e.g. MLFLOW_TRACKING_URI is non-empty or
FLYTE_INTERNAL_EXECUTION_PROJECT is set); implement a simple conditional that
detects those env vars and sets trainer.logger to '["console","mlflow"]' when
true and '["console"]' otherwise, referencing the trainer.logger assignment and
the sourced setup_mlflow_hf.sh for placement.
Commit 65f827f set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to relieve the reverse-KL / JSD path fragmentation pressure inside _opsd_training_step. But sglang's torch_memory_saver hard-asserts against this allocator in _ensure_initialized: RuntimeError: TorchMemorySaver is disabled for the current process because expandable_segments is not supported yet. triggered from sglang.srt.model_executor.model_runner.load_model -> torch_memory_saver/entrypoint.py:199. This crashes the SGLang rollout during model loading, which kills the Master replica and fails the PyTorchJob with an opaque CalledProcessError. Observed in tu50/tu100 retries (Flyte f486fd4845df14f5b930 / f782dcc0ede2e4c1babf), which were the only resubmissions whose mldev tarball was uploaded after 65f827f. tu1/tu10/tu20 retries uploaded before 65f827f and proceeded normally. Leave a comment in place so future-me doesn't re-add this. The underlying reverse-KL OOM will be addressed in opsd_worker.py via smaller chunk_size in _compute_*_loss_liger. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both tu10 (f2e8314b96b034161b5b) and tu20 (f6c97d63f75ee4383837) OOMed
in actor_rollout_ref_update_opsd with peaks ~72 GiB on H100 80 GB. The
hot allocations are the per-chunk fp32 intermediates inside the JSD /
reverse-KL paths in opsd_worker._compute_*_loss[_liger]:
- log_softmax, exp, mixture, kl_t, kl_s — up to ~6 buffers of
(chunk_size, V=151,936) fp32 each
- at chunk_size=256 that's 6 * 256 * 152K * 4 = ~935 MiB per
transient buffer; the autograd graph retains the student-side
versions until backward, compounding pressure
Cut all chunk_size defaults to 1/4:
- standard JSD / reverse-KL: 512 -> 128
- liger JSD / reverse-KL: 256 -> 64
- standard entropy: 512 -> 128
- liger entropy: 256 -> 64
This shrinks each transient ~234 MiB (chunk_size=128, fp32) or ~117 MiB
(chunk_size=64), giving ~3-4 GiB peak-memory relief in the JSD path.
Loss math is unchanged — chunking is purely an algorithmic device.
Per the goal's OOM directive, also halve train_batch_size 8 -> 4 in
qwen3-14b-opsd-length-prune-{tu10,tu20}.json. With dp_world=2 the
padding path stays clean (4 % 2 == 0). Effective compute per step is
the same (micro_batch_size=1 means each step still backprops one
sample at a time); only the gradient-accumulation count changes from
4 to 2.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Brings the OPSD trainer/worker, verifier, dataset config, and SFT launch
scripts up to date with the verl 0.7.0.7 hybrid-engine API that ships in
the cluster image (
mldev-verl-sglang-py312-cu13-image:1.0.206, verl0.7.0.7+deec5d0.cu130). Drops the deadverl/overlay path in thesetup script, drops the SFT-style
_compute_val_lossmachinery, andfixes the rollout↔trainer weight-sync gap that was silently keeping
sglang generations on the initial checkpoint.
What changed
Trainer / worker (
workspace/src/self_distill_hybrid/)CheckpointEngineManagerinOPSDTrainer.init_workers. Pushtrainer weights to sglang replicas at
fit()start, sleep replicasafter every gen, re-sync after each
update_opsd. Without this therollout was pinned to init weights for the whole run.
OPSDWorkerdirectly onAsyncActorRolloutRefWorker; deletethe now-orphan
SelfDistillWorkerbase class.update_teacher: switch toDispatch.ONE_TO_ALL(was sending adummy DataProto over an nd-compute dispatcher).
_opsd_update: pad train batches to actual DP world(
total_gpus / (TP * Ulysses-SP)), not total GPUs._compute_val_loss/_build_val_dataand thesd_val_filesplumbing. Generation-based
_validateoverdata.val_files+ rewardfn is the sole val signal.
_validate: drop the spurioussleep_replicas()that left rolloutasleep when the function returned.
AgentLoopManager.create: passactor_rollout_resource_pooltomatch verl PPO's call site.
main_opsd: constructval_reward_fnby directly instantiatingthe legacy
verl.workers.reward_manager.naive.NaiveRewardManager.verl 0.7.x'sload_reward_managerreturns the experimentalasync-only manager which isn't
__call__-able — that crashed_validateat the first val step.Verifier (
sd_verifier.py)_tokenize_sequencerefuses any pair whose teacher OR studentsequence would exceed
max_lengthinstead of silently truncating.One-sided truncation misaligned teacher/student response logits
per-sample for the whole batch tail. Defensive assert added in
_opsd_training_step.max_lengthso misconfigurations are loud._log_first_opsd_pairfor a readable INFO-level preview ofthe first
(teacher, student, response)triple per batch.Yaml (
config/opsd_trainer.yaml)custom_reward_functionblock; reward fn livesunder the inherited
reward.custom_reward_function.sd_val_files/opsd.val_max_samples(val-loss machinery gone).Scripts (
workspace/scripts/sft/)setup_sft.sh: stop uninstalling and re-installing verl from a localvendored tree; the image ships matching verl. Just
pip installthe two extras (
math-verify,tensordict).train_opsd.sh: dropSD_VAL_PROMPTS_PATH, theVERL_ROOTPYTHONPATHprefix, and the broken top-levelcustom_reward_function.pathoverride (it never reachedload_reward_manager). Wire the dual-path math_verify scorer viareward.custom_reward_function.path+.name. DefaultREWARD_FN_PATHtoworkspace/src/rewards/dual_path_math_verify.py.Add
TOTAL_TRAINING_STEPShard cap. Tighten with-eo pipefail.Test plan
python3 -m ast).from verl.*import against verl 0.7.0.7module layout — all symbols resolve.
build_opsd_batchwrites 8 keys;_opsd_training_stepreadsthe same 8.
_validateno longer leaves replicas asleep on return(matches verl PPO
_validatepattern).f5b55303acc5540a5b75(tu0) — runs to firstval step without the
'NaiveRewardManager' object is not callablecrash.fcbed452f8d7640709d2(tu100) — same plusupdate_teacherfires twice at steps 100 and 200.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Chores
Style/Diagnostics