Skip to content

OPSD: rebase self_distill_hybrid on verl 0.7.0.7 API#11

Open
HJSang wants to merge 7 commits into
mainfrom
hejian/Test_M
Open

OPSD: rebase self_distill_hybrid on verl 0.7.0.7 API#11
HJSang wants to merge 7 commits into
mainfrom
hejian/Test_M

Conversation

@HJSang
Copy link
Copy Markdown
Owner

@HJSang HJSang commented May 30, 2026

Summary

Brings the OPSD trainer/worker, verifier, dataset config, and SFT launch
scripts up to date with the verl 0.7.0.7 hybrid-engine API that ships in
the cluster image (mldev-verl-sglang-py312-cu13-image:1.0.206, verl
0.7.0.7+deec5d0.cu130). Drops the dead verl/ overlay path in the
setup script, drops the SFT-style _compute_val_loss machinery, and
fixes the rollout↔trainer weight-sync gap that was silently keeping
sglang generations on the initial checkpoint.

What changed

Trainer / worker (workspace/src/self_distill_hybrid/)

  • Wire CheckpointEngineManager in OPSDTrainer.init_workers. Push
    trainer weights to sglang replicas at fit() start, sleep replicas
    after every gen, re-sync after each update_opsd. Without this the
    rollout was pinned to init weights for the whole run.
  • Rebase OPSDWorker directly on AsyncActorRolloutRefWorker; delete
    the now-orphan SelfDistillWorker base class.
  • update_teacher: switch to Dispatch.ONE_TO_ALL (was sending a
    dummy DataProto over an nd-compute dispatcher).
  • _opsd_update: pad train batches to actual DP world
    (total_gpus / (TP * Ulysses-SP)), not total GPUs.
  • Drop _compute_val_loss / _build_val_data and the sd_val_files
    plumbing. Generation-based _validate over data.val_files + reward
    fn is the sole val signal.
  • _validate: drop the spurious sleep_replicas() that left rollout
    asleep when the function returned.
  • AgentLoopManager.create: pass actor_rollout_resource_pool to
    match verl PPO's call site.
  • main_opsd: construct val_reward_fn by directly instantiating
    the legacy verl.workers.reward_manager.naive.NaiveRewardManager.
    verl 0.7.x's load_reward_manager returns the experimental
    async-only manager which isn't __call__-able — that crashed
    _validate at the first val step.

Verifier (sd_verifier.py)

  • _tokenize_sequence refuses any pair whose teacher OR student
    sequence would exceed max_length instead of silently truncating.
    One-sided truncation misaligned teacher/student response logits
    per-sample for the whole batch tail. Defensive assert added in
    _opsd_training_step.
  • Skip-counter warning now surfaces the over-length skip ratio +
    max_length so misconfigurations are loud.
  • Add _log_first_opsd_pair for a readable INFO-level preview of
    the first (teacher, student, response) triple per batch.

Yaml (config/opsd_trainer.yaml)

  • Drop dead top-level custom_reward_function block; reward fn lives
    under the inherited reward.custom_reward_function.
  • Drop sd_val_files / opsd.val_max_samples (val-loss machinery gone).

Scripts (workspace/scripts/sft/)

  • setup_sft.sh: stop uninstalling and re-installing verl from a local
    vendored tree; the image ships matching verl. Just pip install
    the two extras (math-verify, tensordict).
  • train_opsd.sh: drop SD_VAL_PROMPTS_PATH, the VERL_ROOT
    PYTHONPATH prefix, and the broken top-level
    custom_reward_function.path override (it never reached
    load_reward_manager). Wire the dual-path math_verify scorer via
    reward.custom_reward_function.path + .name. Default
    REWARD_FN_PATH to workspace/src/rewards/dual_path_math_verify.py.
    Add TOTAL_TRAINING_STEPS hard cap. Tighten with -eo pipefail.

Test plan

  • All seven source files + yaml parse cleanly (python3 -m ast).
  • Cross-checked every from verl.* import against verl 0.7.0.7
    module layout — all symbols resolve.
  • Verified DataProto tensor-key contract producer↔consumer:
    build_opsd_batch writes 8 keys; _opsd_training_step reads
    the same 8.
  • Confirmed _validate no longer leaves replicas asleep on return
    (matches verl PPO _validate pattern).
  • LVA1 submission f5b55303acc5540a5b75 (tu0) — runs to first
    val step without the 'NaiveRewardManager' object is not callable crash.
  • LVA1 submission fcbed452f8d7640709d2 (tu100) — same plus
    update_teacher fires twice at steps 100 and 200.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Generation-based validation with CLI overrides for custom reward function, training-steps cap, and pre-validation control.
  • Bug Fixes

    • Reject over-length samples to avoid token misalignment; clearer skipped/total drop messages with guidance.
  • Refactor

    • Async hybrid rollouts with checkpoint-driven weight sync; trainer/worker interfaces and teacher-update flow revised.
  • Chores

    • Scripts enforce stricter error handling; installer uses image-provided stack and only adds missing extras; DAPO preprocessing drops validation split.
  • Style/Diagnostics

    • Improved logging and broader package verification output (prints version and expanded package list).

Review Change Stack

Bring the OPSD trainer/worker, verifier, dataset config, and the SFT
launch scripts up to date with the verl 0.7.0.7 hybrid-engine surface
that ships in the cluster image.

Trainer / worker (workspace/src/self_distill_hybrid)
- Wire CheckpointEngineManager: construct it in OPSDTrainer.init_workers,
  push trainer weights to the sglang replicas at fit() start, sleep
  replicas after every gen, and re-sync after each update_opsd so the
  on-policy invariant actually holds. Without this the rollout was
  pinned to the initial checkpoint for the whole run.
- Rebase OPSDWorker directly on AsyncActorRolloutRefWorker; delete the
  now-unreferenced SelfDistillWorker base class and its update_sft path.
- update_teacher: switch from make_nd_compute_dataproto_dispatch_fn to
  Dispatch.ONE_TO_ALL so it fires once per FSDP rank with no dummy data.
- _opsd_update: pad train batches to actual DP world
  (total_gpus / (TP * Ulysses-SP)), not total GPUs, to stop the
  TP*SP-fold over-padding under the current TP=2 SP=4 config.
- Drop _compute_val_loss / _build_val_data and the sd_val_files plumbing.
  Generation-based _validate over data.val_files + a reward fn is now
  the sole val signal.
- _validate: drop the spurious sleep_replicas() that left rollout
  asleep when the function returned, breaking the next iter's gen.
- AgentLoopManager.create: pass actor_rollout_resource_pool to match
  the verl PPO trainer's call site.
- main_opsd: build val_reward_fn by directly instantiating the legacy
  verl.workers.reward_manager.naive.NaiveRewardManager
  (the one verl 0.7.x's load_reward_manager returns is the experimental
  async-only manager and is not __call__-able, which crashed
  _validate at the first val step).

Verifier (sd_verifier.py)
- _tokenize_sequence now refuses any pair whose teacher OR student
  sequence would exceed max_length instead of silently truncating; the
  truncation case misaligned teacher/student response logits per-sample
  for the whole batch tail. Defensive assert in
  _opsd_training_step catches any future violation.
- Skip-counter warning surfaces the over-length skip ratio + max_length
  so misconfigurations are loud.
- Add _log_first_opsd_pair for a single readable INFO-level preview of
  the first (teacher, student, response) triple per batch.

Yaml (config/opsd_trainer.yaml)
- Drop the dead top-level custom_reward_function block; reward fn lives
  under the inherited reward.custom_reward_function and is overridden
  there from train_opsd.sh.
- Drop sd_val_files / opsd.val_max_samples (val-loss machinery gone).

Scripts (workspace/scripts/sft)
- setup_sft.sh: stop uninstalling and re-installing verl from a local
  vendored tree; the image ships the matching verl 0.7.0.7+deec5d0.cu130
  build. Just pip-install the two extras (math-verify, tensordict).
- train_opsd.sh: drop SD_VAL_PROMPTS_PATH, the VERL_ROOT PYTHONPATH
  prefix, and the broken top-level custom_reward_function.path override
  (it never reached load_reward_manager). Wire the dual-path math_verify
  scorer via reward.custom_reward_function.path + .name. Default
  REWARD_FN_PATH to workspace/src/rewards/dual_path_math_verify.py.
  Add TOTAL_TRAINING_STEPS hard cap. Tighten with -eo pipefail.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 30, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Migrate OPSD to VERL 0.7.x hybrid rollouts: use the image's verl stack and install extras, tighten train script and YAML reward config, change OPSDWorker to AsyncActorRolloutRefWorker, refactor OPSDTrainer to checkpoint-driven replica sync, and switch validation to generation/reward-based flow with stricter tokenization.

Changes

OPSD VERL 0.7.x Migration

Layer / File(s) Summary
Environment Setup and Configuration
workspace/scripts/sft/setup_sft.sh, workspace/scripts/sft/train_opsd.sh, workspace/src/self_distill_hybrid/config/opsd_trainer.yaml
Use container-provided verl stack and install only math-verify/tensordict. train_opsd.sh enables set -xeo pipefail, adjusts PYTHONPATH to ${SD_SRC}, adds REWARD_FN_PATH and TOTAL_TRAINING_STEPS, removes val-file CLI injection, and wires reward CLI keys to reward.custom_reward_function.*. YAML migrates reward defaults to reward@reward: reward and enables legacy_reward_impl.
Trainer Hybrid Infrastructure Integration
workspace/src/self_distill_hybrid/opsd_trainer.py
Integrates CheckpointEngineManager and AgentLoopManager.create(...), computes self.dp_world for DP-axis padding, removes val_data_path usage, wakes and pushes weights to replicas before generation, sleeps replicas after generation to free memory, and forces resync after teacher updates; _update_teacher_weights() simplified.
Validation Flow Refactoring
workspace/src/self_distill_hybrid/main_opsd.py
Removes SD-val parquet auto-detection and load_reward_manager use; when config.data.val_files is set, builds a NaiveRewardManager using get_custom_reward_fn (routed by reward_fn_key, num_examine=0); trainer receives val_reward_fn and val_dataset (no val_data_path).
Worker Architecture Refactoring
workspace/src/self_distill_hybrid/opsd_worker.py
OPSDWorker now extends AsyncActorRolloutRefWorker; registration imports updated; update_teacher becomes a Dispatch.ONE_TO_ALL control method with no DataProto argument; shard-level param copy replaces FSDP full-parameter materialization and adds runtime shape checks.
Batch Tokenization and Logits/Entropy Alignment
workspace/src/self_distill_hybrid/sd_verifier.py, workspace/src/self_distill_hybrid/opsd_worker.py
sd_verifier.py rejects samples exceeding max_length, adds _log_first_opsd_pair logging, and expands skipped-sample warnings. opsd_worker.py asserts teacher/student response-token counts match (skips zero-response cases) and weights entropy accumulation by actual response-token count (n_resp) instead of truncation-based min_len.
JSD / Entropy Comments and Chunking
workspace/src/self_distill_hybrid/opsd_worker.py
Clarify chunking memory comments, differentiability of chunk-wise .float(), log-space mixture formulation for JSD, and log-space entropy computation comments.
DAPO Preprocessing Simplification
workspace/src/data/process_eval_data.py
process_dapo now returns a single training datasets.Dataset (no train/val split), removes writing val_dapo.parquet, and updates printed summaries to omit DAPO validation counts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

A rabbit tunes the rollout stream with care,
Swapping truncation for truths laid bare,
Checkpoints hum as replicas wake and sleep,
Teachers copy weights while students learn deep,
Hop, VERL, hop—let hybrid training leap! 🐇✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically describes the main objective of the changeset: rebasing the self_distill_hybrid module on the verl 0.7.0.7 API.
Docstring Coverage ✅ Passed Docstring coverage is 85.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch hejian/Test_M

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Run on LVA1 with TP=2 SP=4 N_GPUS=8 crashed in `_opsd_update` with:

    AssertionError: only support equal chunk.
        Got size of DataProto 15 and chunk 2.

after the verifier correctly dropped 1 over-length sample (16 → 15).

The padding target was wrong: I had `dp_world = total_gpus / (TP * SP)`
which gives 1 for the failing config, so the `n_samples % dp_world != 0`
gate never fired and the un-padded 15-row batch reached the dispatcher.
But the dispatcher's `make_nd_compute_dataproto_dispatch_fn(mesh_name="actor")`
slices along the actor mesh's DP axis, and the actor mesh has dims
`(dp, ulysses_sp)` only — TP is rollout-only (sglang's
`tensor_model_parallel_size`) and does NOT collapse the actor's DP axis.

Correct formula: `dp_world = total_gpus / ulysses_sp`. For the failing
config that's 8/4 = 2, which matches the `chunks=2` the dispatcher
demanded; 15 then pads to 16 and the dispatcher splits cleanly into
two 8-row chunks.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
workspace/src/self_distill_hybrid/opsd_worker.py (1)

79-85: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast if actor/ref parameter layouts diverge.

zip() truncates silently. If the new worker wiring ever produces different parameter sets or ordering between actor and ref, this will copy only a prefix and still report success, leaving the teacher partially stale.

🐛 Suggested change
-                for p_student, p_teacher in zip(
-                    self.actor_module_fsdp.parameters(),
-                    self.ref_module_fsdp.parameters(),
-                ):
-                    p_teacher.data.copy_(p_student.data)
+                actor_params = dict(self.actor_module_fsdp.named_parameters())
+                ref_params = dict(self.ref_module_fsdp.named_parameters())
+                if actor_params.keys() != ref_params.keys():
+                    raise RuntimeError(
+                        "Actor/ref parameter sets diverged; refusing partial teacher sync"
+                    )
+                for name, p_student in actor_params.items():
+                    ref_params[name].data.copy_(p_student.data)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workspace/src/self_distill_hybrid/opsd_worker.py` around lines 79 - 85, The
current loop uses zip over self.actor_module_fsdp.parameters() and
self.ref_module_fsdp.parameters(), which silently truncates if the parameter
sequences differ; instead, inside the FSDP.summon_full_params blocks for
actor_module_fsdp and ref_module_fsdp, materialize both parameter lists (e.g.,
list(self.actor_module_fsdp.parameters()) and
list(self.ref_module_fsdp.parameters())), compare their lengths (and optionally
validate matching shapes/names) and raise a RuntimeError with clear details if
they diverge, and only then proceed to copy each p_student.data to
p_teacher.data using the matched lists (references: actor_module_fsdp,
ref_module_fsdp, FSDP.summon_full_params, parameters(), p_student, p_teacher).
workspace/src/self_distill_hybrid/opsd_trainer.py (1)

580-595: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keep val_before_train on its own step.

self.global_steps is incremented before the optional pre-train validation runs, so val_before_train gets logged at the same step as the first optimization step. That makes the charts and step-aligned comparisons ambiguous. Log the warm-up validation at step 0, or move the increment to after each train batch.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workspace/src/self_distill_hybrid/opsd_trainer.py` around lines 580 - 595,
The pre-training validation is logged at the wrong step because
self.global_steps is incremented before the optional val_before_train check;
move the increment so that the validation (triggered by val_before_train and
calling self._validate()) is logged at step 0: call
self.checkpoint_manager.update_weights(), run the optional validation and
logger.log(data=val_metrics, step=self.global_steps) while self.global_steps is
still 0, and only increment self.global_steps (self.global_steps += 1) after
validation (or after the first training batch) so the warm-up validation is
recorded on its own step.
🧹 Nitpick comments (2)
workspace/scripts/sft/setup_sft.sh (1)

17-17: ⚡ Quick win

Constrain the extra installs to the validated stack.

This unconstrained install can still upgrade shared transitive packages in the image, so identical runs may pick up different math-verify/tensordict behavior over time. Please install these through the repo’s pinned constraints or explicit versions instead of floating to latest.

♻️ Suggested change
-pip install math-verify tensordict
+python3 -m pip install \
+  -c "${REPO_ROOT}/workspace/requirements.txt" \
+  math-verify \
+  tensordict
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workspace/scripts/sft/setup_sft.sh` at line 17, Replace the floating pip
install in setup_sft.sh (the line "pip install math-verify tensordict") with an
installation that uses the repository's pinned constraints or explicit versions;
e.g. use "pip install -c <constraints-file> math-verify tensordict" or pin
specific versions like "math-verify==<version> tensordict==<version>" so
transitive dependency upgrades are prevented and installs are reproducible.
workspace/scripts/sft/train_opsd.sh (1)

47-47: ⚡ Quick win

Avoid adding an empty PYTHONPATH entry.

If PYTHONPATH is empty here, this expands to ${SD_SRC}:, which adds the current working directory to Python’s import path. That makes module resolution depend on where the script is launched from.

♻️ Suggested change
-export PYTHONPATH="${SD_SRC}:${PYTHONPATH}"
+export PYTHONPATH="${SD_SRC}${PYTHONPATH:+:${PYTHONPATH}}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workspace/scripts/sft/train_opsd.sh` at line 47, The export currently
unconditionally appends PYTHONPATH which can produce an empty trailing entry;
update the export of PYTHONPATH (the line referencing SD_SRC and PYTHONPATH) so
that you prepend SD_SRC but only append the existing PYTHONPATH when it is
non-empty — do this by using a shell conditional or parameter expansion that
adds the colon and PYTHONPATH only if PYTHONPATH is set, thereby avoiding adding
an empty entry that would implicitly include the current working directory.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@workspace/src/self_distill_hybrid/opsd_trainer.py`:
- Around line 97-109: The dp_world computation in _opsd_update currently uses
trainer-wide GPU counts (total_gpus) which can be larger than the role-specific
actor pool; after actor_rollout_wg is created, recompute dp_world from that
group's actual world size (e.g., use actor_rollout_wg.size or
actor_rollout_wg.world_size) and divide by (tensor_model_parallel_size *
ulysses_sequence_parallel_size) instead of using
self.config.trainer.n_gpus_per_node * nnodes; update the code in _opsd_update to
set self.dp_world based on the actor_rollout_wg world size and keep the same
denom logic (tp and sp) so padding/divisor uses the real actor worker group
size.

In `@workspace/src/self_distill_hybrid/sd_verifier.py`:
- Around line 335-340: The current logger.info call in sd_verifier.py is
emitting raw prompt and response content (logger.info(..., max_length, t_msgs,
s_msgs, len(response_text), response_preview)), which is a privacy/compliance
risk and too noisy; change this to logger.debug and sanitize/truncate the
payloads before logging (e.g., replace or mask sensitive tokens and limit
t_msgs, s_msgs, and response_preview to a fixed small length like 200 chars or a
summarised count), or alternatively log only metadata (max_length and response
length) at INFO and the sanitized/truncated content at DEBUG; update the call
site that references logger.info, max_length, t_msgs, s_msgs, response_text, and
response_preview accordingly.
- Around line 384-386: In build_opsd_batch(), the loop uses zip(teacher_prompts,
student_prompts, responses) which silently truncates mismatched lengths; add an
explicit length check comparing len(teacher_prompts), len(student_prompts), and
len(responses) (or use zip(..., strict=True) since we target Python 3.10+) and
raise a clear ValueError including the three lengths if they differ, so the
iteration cannot proceed with misaligned inputs; update the for-loop (identified
by the tuple unpacking for idx, (t_prompt, s_prompt, response_text) in
enumerate(...)) to rely on the guarded lengths or strict zip.

---

Outside diff comments:
In `@workspace/src/self_distill_hybrid/opsd_trainer.py`:
- Around line 580-595: The pre-training validation is logged at the wrong step
because self.global_steps is incremented before the optional val_before_train
check; move the increment so that the validation (triggered by val_before_train
and calling self._validate()) is logged at step 0: call
self.checkpoint_manager.update_weights(), run the optional validation and
logger.log(data=val_metrics, step=self.global_steps) while self.global_steps is
still 0, and only increment self.global_steps (self.global_steps += 1) after
validation (or after the first training batch) so the warm-up validation is
recorded on its own step.

In `@workspace/src/self_distill_hybrid/opsd_worker.py`:
- Around line 79-85: The current loop uses zip over
self.actor_module_fsdp.parameters() and self.ref_module_fsdp.parameters(), which
silently truncates if the parameter sequences differ; instead, inside the
FSDP.summon_full_params blocks for actor_module_fsdp and ref_module_fsdp,
materialize both parameter lists (e.g.,
list(self.actor_module_fsdp.parameters()) and
list(self.ref_module_fsdp.parameters())), compare their lengths (and optionally
validate matching shapes/names) and raise a RuntimeError with clear details if
they diverge, and only then proceed to copy each p_student.data to
p_teacher.data using the matched lists (references: actor_module_fsdp,
ref_module_fsdp, FSDP.summon_full_params, parameters(), p_student, p_teacher).

---

Nitpick comments:
In `@workspace/scripts/sft/setup_sft.sh`:
- Line 17: Replace the floating pip install in setup_sft.sh (the line "pip
install math-verify tensordict") with an installation that uses the repository's
pinned constraints or explicit versions; e.g. use "pip install -c
<constraints-file> math-verify tensordict" or pin specific versions like
"math-verify==<version> tensordict==<version>" so transitive dependency upgrades
are prevented and installs are reproducible.

In `@workspace/scripts/sft/train_opsd.sh`:
- Line 47: The export currently unconditionally appends PYTHONPATH which can
produce an empty trailing entry; update the export of PYTHONPATH (the line
referencing SD_SRC and PYTHONPATH) so that you prepend SD_SRC but only append
the existing PYTHONPATH when it is non-empty — do this by using a shell
conditional or parameter expansion that adds the colon and PYTHONPATH only if
PYTHONPATH is set, thereby avoiding adding an empty entry that would implicitly
include the current working directory.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 80dec8bf-93c0-4d5f-a023-102d957db186

📥 Commits

Reviewing files that changed from the base of the PR and between 310cf27 and 80ad2fa.

📒 Files selected for processing (8)
  • workspace/scripts/sft/setup_sft.sh
  • workspace/scripts/sft/train_opsd.sh
  • workspace/src/self_distill_hybrid/config/opsd_trainer.yaml
  • workspace/src/self_distill_hybrid/main_opsd.py
  • workspace/src/self_distill_hybrid/opsd_trainer.py
  • workspace/src/self_distill_hybrid/opsd_worker.py
  • workspace/src/self_distill_hybrid/sd_verifier.py
  • workspace/src/self_distill_hybrid/sd_worker.py
💤 Files with no reviewable changes (1)
  • workspace/src/self_distill_hybrid/sd_worker.py

Comment on lines +97 to +109
# DP world size: total GPUs / (TP * Ulysses-SP). Used as the divisor
# when padding batches before the nd-compute dispatcher slices along
# the actor mesh's DP axis. Using total GPUs (as we did previously)
# over-pads whenever TP*SP > 1.
total_gpus = self.config.trainer.n_gpus_per_node * self.config.trainer.nnodes
tp = self.config.actor_rollout_ref.rollout.get("tensor_model_parallel_size", 1)
sp = self.config.actor_rollout_ref.actor.get("ulysses_sequence_parallel_size", 1)
denom = max(1, int(tp) * int(sp))
self.dp_world = max(1, total_gpus // denom)
py_logger.info(
"DP world = %d (total_gpus=%d, TP=%d, ulysses_sp=%d)",
self.dp_world, total_gpus, tp, sp,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Derive dp_world from the actor worker group.

Line 101 derives the padding divisor from trainer-wide GPU config, but this trainer now explicitly allocates a role-specific actor_rollout_resource_pool. If that pool is smaller than the full trainer allocation, _opsd_update() will pad to the wrong divisor and the DP sharding path can mispartition or assert. Compute dp_world after self.actor_rollout_wg is created, using its actual world size instead.

Suggested fix
-        total_gpus = self.config.trainer.n_gpus_per_node * self.config.trainer.nnodes
         tp = self.config.actor_rollout_ref.rollout.get("tensor_model_parallel_size", 1)
         sp = self.config.actor_rollout_ref.actor.get("ulysses_sequence_parallel_size", 1)
-        denom = max(1, int(tp) * int(sp))
-        self.dp_world = max(1, total_gpus // denom)
-        py_logger.info(
-            "DP world = %d (total_gpus=%d, TP=%d, ulysses_sp=%d)",
-            self.dp_world, total_gpus, tp, sp,
-        )
+        self._dp_denom = max(1, int(tp) * int(sp))
+        self.dp_world = None
@@
         self.actor_rollout_wg = all_wg[str(Role.ActorRolloutRef)]
         self.actor_rollout_wg.init_model()
+        self.dp_world = max(1, self.actor_rollout_wg.world_size // self._dp_denom)
+        py_logger.info(
+            "DP world = %d (actor_world_size=%d, TP=%d, ulysses_sp=%d)",
+            self.dp_world,
+            self.actor_rollout_wg.world_size,
+            tp,
+            sp,
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workspace/src/self_distill_hybrid/opsd_trainer.py` around lines 97 - 109, The
dp_world computation in _opsd_update currently uses trainer-wide GPU counts
(total_gpus) which can be larger than the role-specific actor pool; after
actor_rollout_wg is created, recompute dp_world from that group's actual world
size (e.g., use actor_rollout_wg.size or actor_rollout_wg.world_size) and divide
by (tensor_model_parallel_size * ulysses_sequence_parallel_size) instead of
using self.config.trainer.n_gpus_per_node * nnodes; update the code in
_opsd_update to set self.dp_world based on the actor_rollout_wg world size and
keep the same denom logic (tp and sp) so padding/divisor uses the real actor
worker group size.

Comment on lines +335 to +340
logger.info(
"OPSD batch[0] sample preview (max_length=%d):\n"
" teacher_prompt (sd_prompt) messages: %s\n"
" student_prompt (sft_prompt) messages: %s\n"
" response (%d chars): %s",
max_length, t_msgs, s_msgs, len(response_text), response_preview,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid logging raw prompt/response content at INFO level.

This emits decoded prompts and response text into routine training logs, which is a privacy/compliance risk and can create noisy, high-volume logs. Gate this behind DEBUG and truncate/sanitize prompt payloads too.

Suggested hardening
 def _log_first_opsd_pair(
     t_prompt: str, s_prompt: str, response_text: str, max_length: int
 ) -> None:
@@
-    response_preview = response_text[:500] + ("…" if len(response_text) > 500 else "")
-    logger.info(
+    t_preview = str(t_msgs)
+    s_preview = str(s_msgs)
+    t_preview = t_preview[:500] + ("…" if len(t_preview) > 500 else "")
+    s_preview = s_preview[:500] + ("…" if len(s_preview) > 500 else "")
+    response_preview = response_text[:500] + ("…" if len(response_text) > 500 else "")
+    logger.debug(
         "OPSD batch[0] sample preview (max_length=%d):\n"
         "  teacher_prompt (sd_prompt) messages: %s\n"
         "  student_prompt (sft_prompt) messages: %s\n"
         "  response (%d chars): %s",
-        max_length, t_msgs, s_msgs, len(response_text), response_preview,
+        max_length, t_preview, s_preview, len(response_text), response_preview,
     )
-    log_first_pair = logger.isEnabledFor(logging.INFO)
+    log_first_pair = logger.isEnabledFor(logging.DEBUG)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workspace/src/self_distill_hybrid/sd_verifier.py` around lines 335 - 340, The
current logger.info call in sd_verifier.py is emitting raw prompt and response
content (logger.info(..., max_length, t_msgs, s_msgs, len(response_text),
response_preview)), which is a privacy/compliance risk and too noisy; change
this to logger.debug and sanitize/truncate the payloads before logging (e.g.,
replace or mask sensitive tokens and limit t_msgs, s_msgs, and response_preview
to a fixed small length like 200 chars or a summarised count), or alternatively
log only metadata (max_length and response length) at INFO and the
sanitized/truncated content at DEBUG; update the call site that references
logger.info, max_length, t_msgs, s_msgs, response_text, and response_preview
accordingly.

Comment on lines +384 to +386
for idx, (t_prompt, s_prompt, response_text) in enumerate(
zip(teacher_prompts, student_prompts, responses)
):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate the target file and inspect around the referenced lines
ls -la workspace/src/self_distill_hybrid || true
sed -n '330,430p' workspace/src/self_distill_hybrid/sd_verifier.py

# Find all occurrences of the zip call (to see if strict=True or other guards exist elsewhere)
rg -n "zip\(teacher_prompts, student_prompts, responses\)" workspace/src/self_distill_hybrid/sd_verifier.py

# Look for any function that builds/assigns teacher_prompts/student_prompts/responses
rg -n "teacher_prompts|student_prompts|responses" workspace/src/self_distill_hybrid/sd_verifier.py | head -n 200

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 5270


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find all call sites of build_opsd_batch
rg -n "build_opsd_batch\(" -S . | head -n 200

# Inspect sd_verifier.py around the function definition (broader context)
sed -n '320,430p' workspace/src/self_distill_hybrid/sd_verifier.py

# Inspect potential surrounding code that prepares these lists in sd_verifier.py
# (look for code that defines/assigns teacher_prompts/student_prompts/responses near where build_opsd_batch is called)
rg -n "teacher_prompts\s*=|student_prompts\s*=|responses\s*=" workspace/src/self_distill_hybrid/sd_verifier.py | head -n 200

# For each build_opsd_batch call site, print a small window around it
# (rg output is bounded; we then sed based on file and approximate line)
rg -n "build_opsd_batch\(" -S . | cut -d: -f1,2 | while read -r f l; do
  echo "---- $f:$l ----"
  sed -n "$((l-20)),$((l+60))p" "$f"
done

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 4155


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '560,720p' workspace/src/self_distill_hybrid/opsd_trainer.py
sed -n '760,880p' workspace/src/self_distill_hybrid/opsd_trainer.py

# Also locate the exact call to build_opsd_batch and print a tight window
rg -n "opsd_batch\s*=\s*build_opsd_batch\(" workspace/src/self_distill_hybrid/opsd_trainer.py
rg -n "build_opsd_batch\(" workspace/src/self_distill_hybrid/opsd_trainer.py | head -n 50

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 13185


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find where non_tensor_batch keys "sd_prompt" and "sft_prompt" are set/produced
rg -n "\"sd_prompt\"|\"sft_prompt\"" workspace/src/self_distill_hybrid | head -n 200

# Inspect sd_dataset.py for building those fields
sed -n '1,260p' workspace/src/self_distill_hybrid/sd_dataset.py
sed -n '260,520p' workspace/src/self_distill_hybrid/sd_dataset.py

# Check for any DataProto/collate logic that enforces list length consistency
rg -n "non_tensor_batch" workspace/src/self_distill_hybrid | head -n 200

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 11309


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look for Python version spec in common project files
ls -la
find . -maxdepth 3 -type f \( -iname "pyproject.toml" -o -iname "setup.cfg" -o -iname "setup.py" -o -iname "requirements*.txt" -o -iname "Pipfile" -o -iname "environment*.yml" -o -iname ".python-version" \) | sed 's|^\./||'

# Print relevant excerpts if present
for f in pyproject.toml setup.cfg setup.py .python-version requirements.txt requirements-dev.txt environment.yml environment.yaml Pipfile; do
  if [ -f "$f" ]; then
    echo "---- $f ----"
    sed -n '1,200p' "$f"
  fi
done

# If pyproject exists, extract the python version constraint lines
if [ -f "pyproject.toml" ]; then
  rg -n "python\s*=" pyproject.toml || true
  rg -n "requires-python" pyproject.toml || true
  rg -n "Programming Language :: Python" -S pyproject.toml setup.cfg setup.py || true
fi

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 647


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '1,200p' workspace/requirements.txt
sed -n '1,200p' README.md

# Any explicit python version files
find . -maxdepth 2 -type f -iname "*python*" -o -iname ".python-version" -o -iname "*runtime*" | sed 's|^\./||'
if [ -f .python-version ]; then cat .python-version; fi

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 8163


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "zip\([^)]*,\s*strict\s*=\s*True" -S workspace/src/self_distill_hybrid || true
rg -n "zip\(" workspace/src/self_distill_hybrid/sd_verifier.py | head -n 50

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 268


Do not let zip silently drop mismatched inputs.

build_opsd_batch() iterates with zip(teacher_prompts, student_prompts, responses); if any upstream contract breaks and the lists differ in length, zip truncates silently and can misalign prompts and responses. Add a length guard (and/or strict=True; repo targets Python 3.10+).

Suggested fix
+    if not (len(teacher_prompts) == len(student_prompts) == len(responses)):
+        raise ValueError(
+            "build_opsd_batch expects equal lengths for "
+            "teacher_prompts, student_prompts, and responses"
+        )
     for idx, (t_prompt, s_prompt, response_text) in enumerate(
-        zip(teacher_prompts, student_prompts, responses)
+        zip(teacher_prompts, student_prompts, responses, strict=True)
     ):
🧰 Tools
🪛 Ruff (0.15.14)

[warning] 385-385: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workspace/src/self_distill_hybrid/sd_verifier.py` around lines 384 - 386, In
build_opsd_batch(), the loop uses zip(teacher_prompts, student_prompts,
responses) which silently truncates mismatched lengths; add an explicit length
check comparing len(teacher_prompts), len(student_prompts), and len(responses)
(or use zip(..., strict=True) since we target Python 3.10+) and raise a clear
ValueError including the three lengths if they differ, so the iteration cannot
proceed with misaligned inputs; update the for-loop (identified by the tuple
unpacking for idx, (t_prompt, s_prompt, response_text) in enumerate(...)) to
rely on the guarded lengths or strict zip.

Two small follow-ups uncovered while watching the LVA1 quick-test run
of the verl 0.7.0.7 rebase.

sd_verifier._log_first_opsd_pair: emit at WARNING and drop the
  logger.isEnabledFor(logging.INFO) gate.

  Hydra raises the __main__ logger to INFO but leaves the root logger
  at the WARNING default, so child loggers like
  self_distill_hybrid.sd_verifier silently swallow INFO calls when
  running inside the Ray actor. Same logger's WARNING messages (e.g.
  the over-length skip report) DO surface, which is what tipped me
  off — the preview was supposed to fire at idx==0 of every
  build_opsd_batch call but never showed up in the cluster log. Use
  logger.warning and drop the gate so it always emits exactly once
  per training step.

process_eval_data.process_dapo: stop producing val_dapo.parquet.

  The downstream training pipeline only validates on AIME24, AIME25,
  and MATH-500 (the auto-detect in train_opsd.sh never picked up
  val_dapo). The previous 5% train/val split was therefore burning
  870 examples that no consumer ever read. Return the full DAPO set
  as the training split; skip the val output and the no-longer-
  reachable summary lines. Module docstring, section banner, and
  process_dapo docstring updated to match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@workspace/src/data/process_eval_data.py`:
- Around line 40-45: The README still describes a 95/5 DAPO preprocessing split
and production of val_dapo.parquet but the code in process_dapo now keeps all
DAPO examples in a single training split; update the README.md documentation to
reflect this new behavior by replacing the 95/5 split description and any
references to val_dapo.parquet with text that states DAPO is processed as a
single training split (no validation output produced) and adjust any example
commands/outputs that mention val_dapo.parquet accordingly; verify and update
the README sections that previously referenced the split and the val_dapo output
so they match the process_dapo implementation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b64964a0-9a48-43a1-a164-8692647becc7

📥 Commits

Reviewing files that changed from the base of the PR and between 0a3867a and 99071c9.

📒 Files selected for processing (2)
  • workspace/src/data/process_eval_data.py
  • workspace/src/self_distill_hybrid/sd_verifier.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • workspace/src/self_distill_hybrid/sd_verifier.py

Comment on lines +40 to +45
def process_dapo(data_path):
"""Process DAPO-Math-17k-dedup as a single training split.

The training pipeline validates on AIME24/AIME25/MATH-500 only, so the
previous train/val split was producing a val_dapo.parquet that no
downstream code consumed. Keep all examples in training instead.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update README preprocessing contract to match the new single-split behavior.

This change keeps all DAPO records in training, but README.md still documents a 95/5 DAPO split and val_dapo.parquet output (README.md Line 89 and Line 99-102). Please update docs in this PR to prevent operator confusion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workspace/src/data/process_eval_data.py` around lines 40 - 45, The README
still describes a 95/5 DAPO preprocessing split and production of
val_dapo.parquet but the code in process_dapo now keeps all DAPO examples in a
single training split; update the README.md documentation to reflect this new
behavior by replacing the 95/5 split description and any references to
val_dapo.parquet with text that states DAPO is processed as a single training
split (no validation output produced) and adjust any example commands/outputs
that mention val_dapo.parquet accordingly; verify and update the README sections
that previously referenced the split and the val_dapo output so they match the
process_dapo implementation.

Hejian Sang and others added 2 commits May 30, 2026 07:15
The previous update_teacher wrapped the param copy in nested
FSDP.summon_full_params(...), which materialises the FULL unsharded
parameters of BOTH actor and ref on every rank for the duration of the
copy. For Qwen3-14B that's ~30 GiB x 2 modules = ~60 GiB on top of the
sharded params already on-GPU. When teacher_update_freq > 0 fires right
after a training step has loaded the actor for fwd/bwd, this puts the
H100 80 GB rank over the line and the rank OOMs with:

    ray::WorkerDict.actor_rollout_ref_update_teacher()
    torch.OutOfMemoryError: Tried to allocate 5.80 GiB.
    GPU 0 has a total capacity of 79.19 GiB of which 2.28 GiB is free.

Observed in tu1 (teacher_update_freq=1) right after val_before_train +
first opsd_update on LVA1 (Flyte fcbed452f8d7640709d2 -> retry
f9941301d258e4619921). tu10/20/50/100 would have hit the same wall on
their first teacher copy.

The actor and ref modules are wrapped with the SAME FSDP policy on the
SAME DP mesh during init in verl's ActorRolloutRefWorker, so their local
parameter shards correspond element-wise. Iterating .parameters() on
both in lock-step and copying p_student.data into p_teacher.data does
the right thing without any unsharded materialisation. Add a shape-equal
assert to make any future wrap-policy drift loud.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Qwen3-14B + Ulysses-SP=4 on H100 80 GB peaks at ~72-74 GiB during the
reverse-KL / JSD path inside _opsd_training_step, leaving little to no
slack on top of sglang's ~2.87 GiB rollout footprint. With the default
best-fit allocator we hit:

    torch.kl_div(input, target, ...)
    torch.OutOfMemoryError: Tried to allocate 150.00 MiB.
    GPU 0 has total 79.19 GiB of which 130.69 MiB is free.
    PyTorch allocated 72.16 GiB, 120 MiB reserved-unallocated.

(observed on the LVA1 quick test fcbed452f8d7640709d2 step 3 and the
tu20 retry fa063558d3d234157b4c step 1). 120 MiB reserved-unallocated
+ < 150 MiB free + 150 MiB requested is the textbook fragmentation
signature; expandable_segments coalesces those holes and almost always
gives a few GB back. PyTorch's own OOM message points at this setting
as the first thing to try.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
workspace/src/self_distill_hybrid/opsd_worker.py (1)

295-301: ⚡ Quick win

Validate response-token counts per sample, not only in aggregate.

This assert only compares flattened totals. A malformed batch can still pass with equal totals while pairing the wrong teacher/student logits across examples. Since the contract is per-pair alignment, check the shifted loss_mask counts row-by-row before flattening.

Proposed refactor
+            t_resp_counts = t_loss_mask[:, 1:].sum(dim=1)
+            s_resp_counts = s_loss_mask[:, 1:].sum(dim=1)
+            if not torch.equal(t_resp_counts, s_resp_counts):
+                raise RuntimeError(
+                    "teacher/student response-token counts differ per sample: "
+                    f"teacher={t_resp_counts.tolist()} student={s_resp_counts.tolist()}"
+                )
             assert teacher_logits.shape[0] == student_logits.shape[0], (
                 f"teacher/student response-token count mismatch: "
                 f"{teacher_logits.shape[0]} vs {student_logits.shape[0]}. "
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workspace/src/self_distill_hybrid/opsd_worker.py` around lines 295 - 301, The
current assert only checks flattened totals (teacher_logits.shape[0] vs
student_logits.shape[0]) which can hide per-example misalignment; compute
per-sample response-token counts from the shifted loss_mask (e.g., resp_counts =
(loss_mask[:, 1:] != 0).sum(dim=1)) and use those counts to (a) assert
resp_counts.sum() equals teacher_logits.shape[0] and student_logits.shape[0],
and (b) verify that splitting/offsetting teacher_logits and student_logits into
per-sample chunks using resp_counts yields matching lengths for each pair;
update the code around teacher_logits/student_logits handling in opsd_worker.py
(and ensure this invariant complements upstream filter_overlong_prompts and
build_opsd_batch).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@workspace/src/self_distill_hybrid/opsd_worker.py`:
- Around line 94-105: The loop in update_teacher uses
zip(self.actor_module_fsdp.parameters(), self.ref_module_fsdp.parameters())
which silently truncates if the two parameter iterators differ; change the loop
to use zip(..., strict=True) so it raises immediately on length mismatch,
preserving the existing per-parameter shape check and RuntimeError behavior
(ensure the runtime environment targets Python 3.10+ as noted in README.md).

---

Nitpick comments:
In `@workspace/src/self_distill_hybrid/opsd_worker.py`:
- Around line 295-301: The current assert only checks flattened totals
(teacher_logits.shape[0] vs student_logits.shape[0]) which can hide per-example
misalignment; compute per-sample response-token counts from the shifted
loss_mask (e.g., resp_counts = (loss_mask[:, 1:] != 0).sum(dim=1)) and use those
counts to (a) assert resp_counts.sum() equals teacher_logits.shape[0] and
student_logits.shape[0], and (b) verify that splitting/offsetting teacher_logits
and student_logits into per-sample chunks using resp_counts yields matching
lengths for each pair; update the code around teacher_logits/student_logits
handling in opsd_worker.py (and ensure this invariant complements upstream
filter_overlong_prompts and build_opsd_batch).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 29287c03-b724-4118-9970-1cb73a5f78aa

📥 Commits

Reviewing files that changed from the base of the PR and between 99071c9 and 4d50e6a.

📒 Files selected for processing (1)
  • workspace/src/self_distill_hybrid/opsd_worker.py

Comment on lines +94 to +105
for p_student, p_teacher in zip(
self.actor_module_fsdp.parameters(),
self.ref_module_fsdp.parameters(),
):
if p_student.shape != p_teacher.shape:
raise RuntimeError(
f"update_teacher shard mismatch at param {n_params}: "
f"actor shard {tuple(p_student.shape)} vs ref shard "
f"{tuple(p_teacher.shape)}. Both FSDP modules must "
"be wrapped with the same policy on the same mesh."
)
p_teacher.data.copy_(p_student.data)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

python - <<'PY'
import sys
print(f"sandbox_python={sys.version.split()[0]}")
PY

fd -HI '^(pyproject\.toml|setup\.cfg|setup\.py|tox\.ini|\.python-version|Dockerfile.*)$' | while read -r f; do
  printf '\n== %s ==\n' "$f"
  rg -n 'requires-python|python_requires|PYTHON_VERSION|FROM .*python:' "$f" || true
done

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 100


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate likely Python version declaration files
files=$(fd -t f '^(pyproject\.toml|setup\.cfg|setup\.py|tox\.ini|\.python-version|Dockerfile.*)$' . || true)
echo "Found python-config files:"
echo "$files" | sed 's|^| - |'

# Extract python version constraints from those files (and a few common extras)
echo
echo "Python version constraints (if present):"
for f in $files; do
  echo
  echo "== $f =="
  sed -n '1,200p' "$f" | rg -n 'requires-python|python_requires|PYTHON_VERSION|FROM .*python:|^FROM python:|target-version|language-version|python_version' || true
  sed -n '1,200p' "$f" | rg -n '\[tool\.poetry\.dependencies\]|\[project\]|\[tool\.poetry\]' || true
done

# Confirm context for the teacher sync code path
echo
echo "Context around opsd_worker.py teacher update parameters copy loop:"
fd -t f 'opsd_worker\.py$' workspace/src/self_distill_hybrid/opsd_worker.py >/dev/null 2>&1 && true
python_file="workspace/src/self_distill_hybrid/opsd_worker.py"
if [ -f "$python_file" ]; then
  rg -n 'update_teacher|teacher shard mismatch|actor_module_fsdp|ref_module_fsdp|p_teacher\.data\.copy_' "$python_file" -n -S
  echo
  # show surrounding lines (avoid dumping too much)
  sed -n '60,140p' "$python_file" | nl -ba | rg -n 'for p_student, p_teacher|teacher shard mismatch|copy_\(p_student|parameters\(' || true
else
  echo "Missing: $python_file"
fi

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 2746


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for common Python version constraint files
candidates=$(
  (fd -t f 'pyproject\.toml$' . || true)
  (fd -t f 'setup\.cfg$' . || true)
  (fd -t f 'setup\.py$' . || true)
  (fd -t f 'tox\.ini$' . || true)
  (fd -t f '\.python-version$' . || true)
  (fd -t f '^Dockerfile' . || true)
)

echo "Constraint file candidates:"
if [ -z "${candidates// }" ]; then
  echo "  (none found)"
else
  echo "$candidates" | sed 's/^/ - /'
fi

# Extract Python version constraints from any found candidates
if [ -n "${candidates// }" ]; then
  echo
  echo "Extracted Python version constraints:"
  echo "$candidates" | while read -r f; do
    [ -f "$f" ] || continue
    echo
    echo "== $f =="
    sed -n '1,220p' "$f" | rg -n 'requires-python|python_requires|PYTHON_VERSION|FROM python:|target-version|language-version|python_version' || true
  done
fi

# Also just print the signature of zip for sanity (sandbox runtime)
python - <<'PY'
import sys, inspect
print("sandbox python:", sys.version)
print("zip signature:", inspect.signature(zip))
PY

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 824


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# List likely CI/workflow files
echo "Root files (top-level):"
ls -la | head -n 60

echo
echo "Searching for Python version declarations in repo files..."
# Common CI markers
rg -n --hidden --glob '!.git/*' \
  'actions/setup-python|python-version:|setup-python|PYTHON_VERSION|requires-python|python_requires|FROM python:|language: python|target-version' \
  . || true

echo
echo "Searching for any usage of zip(strict=...) in the repo..."
rg -n --hidden --glob '!.git/*' 'zip\([^)]*strict\s*=' . || true

echo
echo "Searching for docs mentioning supported Python versions..."
rg -n --hidden --glob '!.git/*' \
  'Python (3\.[0-9]+|>=|minimum|supported|compatibility)|python 3\.[0-9]+' \
  README* docs* .github* . || true

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 1005


🏁 Script executed:

#!/bin/bash
set -euo pipefail

python_file="workspace/src/self_distill_hybrid/opsd_worker.py"
sed -n '80,130p' "$python_file" | nl -ba | sed -n '1,120p'

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 121


🏁 Script executed:

#!/bin/bash
set -euo pipefail

python_file="workspace/src/self_distill_hybrid/opsd_worker.py"

echo "=== Exact zip loop block ==="
# Print a tighter window around the zip() loop
rg -n "for p_student, p_teacher in zip\(" "$python_file"
start=$(rg -n "for p_student, p_teacher in zip\(" "$python_file" | head -n1 | cut -d: -f1)
echo "Start line: $start"
lo=$((start-5))
hi=$((start+30))
sed -n "${lo},${hi}p" "$python_file" | cat -n

echo
echo "=== Context lines 70-120 ==="
sed -n '70,120p' "$python_file" | cat -n

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 4927


Fail fast when FSDP parameter iterators diverge in update_teacher. zip() truncates to the shorter iterator; if actor_module_fsdp.parameters() and ref_module_fsdp.parameters() produce different lengths, trailing teacher shards won’t be copied (yet the update still logs success). Use zip(..., strict=True) (Python 3.10+ per README.md).

Proposed fix
             for p_student, p_teacher in zip(
                 self.actor_module_fsdp.parameters(),
                 self.ref_module_fsdp.parameters(),
+                strict=True,
             ):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for p_student, p_teacher in zip(
self.actor_module_fsdp.parameters(),
self.ref_module_fsdp.parameters(),
):
if p_student.shape != p_teacher.shape:
raise RuntimeError(
f"update_teacher shard mismatch at param {n_params}: "
f"actor shard {tuple(p_student.shape)} vs ref shard "
f"{tuple(p_teacher.shape)}. Both FSDP modules must "
"be wrapped with the same policy on the same mesh."
)
p_teacher.data.copy_(p_student.data)
for p_student, p_teacher in zip(
self.actor_module_fsdp.parameters(),
self.ref_module_fsdp.parameters(),
strict=True,
):
if p_student.shape != p_teacher.shape:
raise RuntimeError(
f"update_teacher shard mismatch at param {n_params}: "
f"actor shard {tuple(p_student.shape)} vs ref shard "
f"{tuple(p_teacher.shape)}. Both FSDP modules must "
"be wrapped with the same policy on the same mesh."
)
p_teacher.data.copy_(p_student.data)
🧰 Tools
🪛 Ruff (0.15.14)

[warning] 94-97: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workspace/src/self_distill_hybrid/opsd_worker.py` around lines 94 - 105, The
loop in update_teacher uses zip(self.actor_module_fsdp.parameters(),
self.ref_module_fsdp.parameters()) which silently truncates if the two parameter
iterators differ; change the loop to use zip(..., strict=True) so it raises
immediately on length mismatch, preserving the existing per-parameter shape
check and RuntimeError behavior (ensure the runtime environment targets Python
3.10+ as noted in README.md).

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
workspace/scripts/sft/train_opsd.sh (1)

36-42: ⚡ Quick win

Don't clobber caller-provided allocator tuning.

This overwrites any existing PYTORCH_CUDA_ALLOC_CONF, so operators can't combine expandable_segments with other allocator knobs they already rely on.

Suggested change
-export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+if [[ -n "${PYTORCH_CUDA_ALLOC_CONF:-}" ]]; then
+    case ",${PYTORCH_CUDA_ALLOC_CONF}," in
+        *,expandable_segments:True,*|*,expandable_segments:true,*) ;;
+        *) export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF},expandable_segments:True" ;;
+    esac
+else
+    export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workspace/scripts/sft/train_opsd.sh` around lines 36 - 42, The script
currently overwrites PYTORCH_CUDA_ALLOC_CONF which destroys caller-provided
tuning; update the export in train_opsd.sh to preserve existing settings by
checking if PYTORCH_CUDA_ALLOC_CONF is set and, if so, append
",expandable_segments:True" (avoiding duplicates) otherwise set it to
"expandable_segments:True", and ensure the code references the
PYTORCH_CUDA_ALLOC_CONF environment variable when composing the final export so
other allocator knobs remain intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@workspace/scripts/sft/train_opsd.sh`:
- Around line 58-63: The script always enables MLflow in the HuggingFace Trainer
by setting trainer.logger='["console","mlflow"]' which causes local runs to
create mlruns/; change the logic in workspace/scripts/sft/train_opsd.sh around
the trainer.logger assignment so that "mlflow" is only added when a tracking
backend is configured (e.g. MLFLOW_TRACKING_URI is non-empty or
FLYTE_INTERNAL_EXECUTION_PROJECT is set); implement a simple conditional that
detects those env vars and sets trainer.logger to '["console","mlflow"]' when
true and '["console"]' otherwise, referencing the trainer.logger assignment and
the sourced setup_mlflow_hf.sh for placement.

---

Nitpick comments:
In `@workspace/scripts/sft/train_opsd.sh`:
- Around line 36-42: The script currently overwrites PYTORCH_CUDA_ALLOC_CONF
which destroys caller-provided tuning; update the export in train_opsd.sh to
preserve existing settings by checking if PYTORCH_CUDA_ALLOC_CONF is set and, if
so, append ",expandable_segments:True" (avoiding duplicates) otherwise set it to
"expandable_segments:True", and ensure the code references the
PYTORCH_CUDA_ALLOC_CONF environment variable when composing the final export so
other allocator knobs remain intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 498686f6-65e2-4738-afca-1c04762d81b6

📥 Commits

Reviewing files that changed from the base of the PR and between 4d50e6a and 65f827f.

📒 Files selected for processing (1)
  • workspace/scripts/sft/train_opsd.sh

Comment on lines +58 to +63
# Auto-configure MLflow tracking when running on the Flyte/Kingkong cluster.
# Sets MLFLOW_TRACKING_URI / MLFLOW_EXPERIMENT_NAME / MLFLOW_RUN_CONTEXT from
# Flyte execution env vars so HuggingFace Trainer (and any other downstream
# code that respects MLFLOW_*) logs to go/mlflowui automatically. Becomes a
# no-op if FLYTE_INTERNAL_EXECUTION_PROJECT is unset (local invocations).
source "${SCRIPT_DIR}/setup_mlflow_hf.sh"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In VERL 0.7.x, what happens when trainer.loggerincludes "mlflow" butMLFLOW_TRACKING_URIis unset? Does it no-op, create a localmlruns store, or fail initialization?

💡 Result:

In VERL 0.7.x, when trainer.logger includes "mlflow" but MLFLOW_TRACKING_URI is unset, the behavior is determined by the underlying MLflow Python library's default configuration rather than explicit handling within the VERL framework [1][2]. By default, when MLFLOW_TRACKING_URI is not set, MLflow initializes a local backend, typically creating a directory named mlruns in the current working directory to store tracking data [1]. VERL’s tracking.py module integrates MLflow by importing it and calling mlflow.log_metrics directly within its logging methods [2]. Because VERL does not override the default MLflow tracking URI, the system will fall back to this default local behavior [1][2]. It does not fail initialization due to the missing environment variable; instead, it proceeds to log locally [1][2]. If you wish to use a specific local path or a remote server, you must explicitly set the MLFLOW_TRACKING_URI environment variable before the trainer initializes, as documented in the VERL guidelines [1].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="workspace/scripts/sft/train_opsd.sh"
if [ ! -f "$FILE" ]; then
  echo "Missing file: $FILE"
  exit 1
fi

echo "---- head around 50-90 ----"
nl -ba "$FILE" | sed -n '45,90p'

echo "---- head around 260-320 ----"
nl -ba "$FILE" | sed -n '260,320p'

# Locate setup_mlflow_hf.sh referenced by SCRIPT_DIR
# Determine SCRIPT_DIR value? Just search for the file name in repo.
echo "---- find setup_mlflow_hf.sh ----"
fd -a "setup_mlflow_hf.sh" . || true

# Print that script if small enough; otherwise print relevant parts.
SETUP="$(fd -a "setup_mlflow_hf.sh" . | head -n 1 || true)"
if [ -n "${SETUP:-}" ] && [ -f "$SETUP" ]; then
  echo "---- file: $SETUP (first 200 lines) ----"
  wc -l "$SETUP"
  nl -ba "$SETUP" | sed -n '1,220p'
fi

# Search for FLYTE_INTERNAL_EXECUTION_PROJECT usage in setup or train script
echo "---- rg FLYTE_INTERNAL_EXECUTION_PROJECT ----"
rg -n "FLYTE_INTERNAL_EXECUTION_PROJECT" "$FILE" . || true

# Search for trainer.logger usage in train_opsd.sh
echo "---- rg trainer\.logger ----"
rg -n "trainer\.logger" "$FILE" || true

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 168


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="workspace/scripts/sft/train_opsd.sh"
if [ ! -f "$FILE" ]; then
  echo "Missing file: $FILE"
  exit 1
fi

echo "---- head around 50-90 ----"
nl -ba "$FILE" | sed -n '45,90p'

echo "---- head around 260-320 ----"
nl -ba "$FILE" | sed -n '260,320p'

# Locate setup_mlflow_hf.sh referenced by SCRIPT_DIR
echo "---- find setup_mlflow_hf.sh ----"
SETUP="$(fd -a "setup_mlflow_hf.sh" . | head -n 1 || true)"
if [ -n "${SETUP:-}" ]; then
  echo "Found: $SETUP"
  wc -l "$SETUP"
  echo "---- file: $SETUP (first 220 lines) ----"
  nl -ba "$SETUP" | sed -n '1,220p'
else
  echo "setup_mlflow_hf.sh not found"
fi

echo "---- rg FLYTE_INTERNAL_EXECUTION_PROJECT in repo ----"
rg -n "FLYTE_INTERNAL_EXECUTION_PROJECT" . || true

echo "---- rg trainer\.logger in train_opsd.sh ----"
rg -n "trainer\.logger" "$FILE" || true

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 168


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="workspace/scripts/sft/train_opsd.sh"
if [ ! -f "$FILE" ]; then
  echo "Missing file: $FILE"
  exit 1
fi

echo "---- head around 50-90 ----"
nl -ba "$FILE" | sed -n '45,90p'

echo "---- head around 260-320 ----"
nl -ba "$FILE" | sed -n '260,320p'

echo "---- find setup_mlflow_hf.sh ----"
SETUP="$(fd -a "setup_mlflow_hf.sh" . | head -n 1 || true)"
if [ -n "${SETUP:-}" ] && [ -f "$SETUP" ]; then
  echo "Found: $SETUP"
  wc -l "$SETUP"
  echo "---- file: $SETUP (first 220 lines) ----"
  nl -ba "$SETUP" | sed -n '1,220p'
else
  echo "setup_mlflow_hf.sh not found"
fi

echo "---- rg FLYTE_INTERNAL_EXECUTION_PROJECT in repo ----"
rg -n "FLYTE_INTERNAL_EXECUTION_PROJECT" . || true

echo "---- rg trainer\.logger in train_opsd.sh ----"
rg -n "trainer\.logger" "$FILE" || true

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 168


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="workspace/scripts/sft/train_opsd.sh"
echo "== file exists? =="
ls -la "$FILE" || true

echo "== print lines 45-90 of $FILE (with numbers) =="

python3 - <<'PY'
from pathlib import Path
path = Path("workspace/scripts/sft/train_opsd.sh")
lines = path.read_text().splitlines()
for i in range(44, 90):  # 1-based -> slice index
    if i < len(lines):
        print(f"{i+1}\t{lines[i]}")
PY

echo "== print lines 260-320 of $FILE (with numbers) =="

python3 - <<'PY'
from pathlib import Path
path = Path("workspace/scripts/sft/train_opsd.sh")
lines = path.read_text().splitlines()
for i in range(259, 320):
    if i < len(lines):
        print(f"{i+1}\t{lines[i]}")
PY

echo "== find setup_mlflow_hf.sh =="

SETUP_PATHS="$(find . -type f -name 'setup_mlflow_hf.sh' 2>/dev/null || true)"
echo "$SETUP_PATHS" | head -n 20

SETUP_PATH="$(echo "$SETUP_PATHS" | head -n 1)"
if [ -n "${SETUP_PATH:-}" ]; then
  echo "== setup file line count =="
  python3 - <<PY
from pathlib import Path
p=Path("$SETUP_PATH")
print(sum(1 for _ in p.open('r',encoding='utf-8',errors='ignore')))
PY

  echo "== print lines 1-220 of $SETUP_PATH (with numbers) =="
  python3 - <<'PY'
from pathlib import Path
path = Path("PLACEHOLDER")
PY
fi

if [ -n "${SETUP_PATH:-}" ]; then
  python3 - <<PY
from pathlib import Path
path = Path("$SETUP_PATH")
lines = path.read_text().splitlines()
for i in range(0, min(220, len(lines))):
    print(f"{i+1}\t{lines[i]}")
PY
fi

echo "== search for FLYTE_INTERNAL_EXECUTION_PROJECT =="

if command -v rg >/dev/null 2>&1; then
  rg -n "FLYTE_INTERNAL_EXECUTION_PROJECT" "$FILE" . || true
else
  grep -RIn "FLYTE_INTERNAL_EXECUTION_PROJECT" "$FILE" . || true
fi

echo "== search for trainer.logger =="

if command -v rg >/dev/null 2>&1; then
  rg -n "trainer\.logger" "$FILE" || true
else
  grep -n "trainer\.logger" "$FILE" || true
fi

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 4852


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== ls workspace/scripts/sft =="
ls -la workspace/scripts/sft || true

echo "== find setup_mlflow_hf.sh (case-insensitive) =="
find workspace/scripts/sft -maxdepth 1 -iname 'setup_mlflow_hf.sh' -print || true

echo "== find setup_mlflow_hf.sh anywhere (case-insensitive, limited to first 20) =="
find . -type f -iname 'setup_mlflow_hf.sh' 2>/dev/null | head -n 20 || true

# If the expected file exists, print it
if [ -f "workspace/scripts/sft/setup_mlflow_hf.sh" ]; then
  echo "== setup_mlflow_hf.sh exists; line count =="
  python3 - <<'PY'
from pathlib import Path
p=Path("workspace/scripts/sft/setup_mlflow_hf.sh")
print(sum(1 for _ in p.open("r",encoding="utf-8",errors="ignore")))
PY
  echo "== setup_mlflow_hf.sh (first 200 lines with numbers) =="
  python3 - <<'PY'
from pathlib import Path
path=Path("workspace/scripts/sft/setup_mlflow_hf.sh")
lines=path.read_text().splitlines()
for i in range(0, min(200, len(lines))):
    print(f"{i+1}\t{lines[i]}")
PY
fi

echo "== rg for setup_mlflow_hf.sh references =="
rg -n "setup_mlflow_hf\.sh" . || true

echo "== rg for FLYTE_INTERNAL_EXECUTION_PROJECT across repo =="
rg -n "FLYTE_INTERNAL_EXECUTION_PROJECT" . || true

echo "== rg for mlflow logger handling in repo (best-effort) =="
rg -n "mlflow" workspace/scripts/sft . || true

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 1474


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== find any setup_mlflow* scripts =="
find . -type f \( -iname 'setup_mlflow*' -o -iname '*mlflow*setup*' \) | head -n 50

echo "== find mlflow setup scripts by name fragments =="
find . -type f \( -iname '*mlflow*.sh' -o -iname '*mlflow*setup*' \) | head -n 50

echo "== rg for MLFLOW_TRACKING_URI / MLFLOW_EXPERIMENT_NAME / MLFLOW_RUN_CONTEXT =="
rg -n "MLFLOW_TRACKING_URI|MLFLOW_EXPERIMENT_NAME|MLFLOW_RUN_CONTEXT" . || true

echo "== rg for go/mlflowui =="
rg -n "go/mlflowui" . || true

echo "== rg for mlruns directory creation =="
rg -n "mlruns" . || true

echo "== rg for trainer.logger usage =="
rg -n "trainer\.logger" . || true

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 777


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="workspace/src/self_distill_hybrid/opsd_trainer.py"

python3 - <<'PY'
from pathlib import Path
path=Path("workspace/src/self_distill_hybrid/opsd_trainer.py")
lines=path.read_text().splitlines()
start=540; end=650  # around 581
for i in range(start-1, min(end, len(lines))):
    print(f"{i+1}\t{lines[i]}")
PY

echo "== rg for default_backend/self.config.trainer.logger usage =="
rg -n "default_backend=self\.config\.trainer\.logger|trainer\.logger|logger" workspace/src/self_distill_hybrid/opsd_trainer.py || true

echo "== rg for mlflow usage in self_distill_hybrid =="
rg -n "mlflow" workspace/src/self_distill_hybrid || true

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 7474


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== locate verl/utils/tracking.py in repo =="

find . -type f -path '*verl*utils*tracking.py' -maxdepth 8 2>/dev/null || true
find . -type f -name 'tracking.py' | head -n 50

echo "== rg for 'class Tracking' =="

rg -n "class Tracking" . || true

echo "== rg for 'default_backend' =="

rg -n "default_backend" . || true

echo "== rg for 'mlflow' inside Tracking implementation paths (best-effort) =="

rg -n "mlflow" . -g'*.py' | head -n 50

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 368


🏁 Script executed:

#!/bin/bash
set -euo pipefail

python3 - <<'PY'
import inspect, os, sys

try:
    import verl.utils.tracking as tracking
except Exception as e:
    print("Failed to import verl.utils.tracking:", e)
    sys.exit(0)

print("Tracking module file:", tracking.__file__)
src = inspect.getsource(tracking)
print("---- tracking.py snippet around Tracking class ----")
# Print only relevant fragments to keep output small
for needle in ["class Tracking", "mlflow", "MLFLOW_TRACKING_URI", "default_backend", "tracking_uri", "set_tracking_uri"]:
    idx = src.find(needle)
    print(needle, "found" if idx!=-1 else "NOT found", "at", idx)
print("---- end ----")

PY

python3 - <<'PY'
import re, inspect
import verl.utils.tracking as tracking
src = inspect.getsource(tracking)

# Extract likely mlflow-related functions/logic
patterns = [
    r"def .*mlflow.*?:.*?(?=\\n\\ndef|\\Z)",
    r"class Tracking.*?:.*?(?=\\n\\n)",
]
print("---- quick grep-like lines containing mlflow/uri ----")
for line in src.splitlines():
    if "mlflow" in line.lower() or "tracking_uri" in line.lower() or "MLFLOW_" in line:
        print(line)

PY

Repository: HJSang/CRISP_Reasoning_Compression

Length of output: 274


Gate the MLflow logger so local runs stay console-only.

workspace/scripts/sft/train_opsd.sh always sets trainer.logger='["console", "mlflow"]' (line 285). Even if setup_mlflow_hf.sh is a local no-op, VERL routes tracking to MLflow whenever "mlflow" is present; with MLFLOW_TRACKING_URI unset, MLflow falls back to its local backend (creating mlruns/), so local invocations won’t be console-only.

Suggested change
+TRAINER_LOGGER='["console"]'
+if [[ -n "${FLYTE_INTERNAL_EXECUTION_PROJECT:-}" ]]; then
+    TRAINER_LOGGER='["console", "mlflow"]'
+fi
+
 python3 -m self_distill_hybrid.main_opsd \
@@
-    trainer.logger='["console", "mlflow"]' \
+    trainer.logger="${TRAINER_LOGGER}" \
🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 63-63: Not following: ./setup_mlflow_hf.sh was not specified as input (see shellcheck -x).

(SC1091)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workspace/scripts/sft/train_opsd.sh` around lines 58 - 63, The script always
enables MLflow in the HuggingFace Trainer by setting
trainer.logger='["console","mlflow"]' which causes local runs to create mlruns/;
change the logic in workspace/scripts/sft/train_opsd.sh around the
trainer.logger assignment so that "mlflow" is only added when a tracking backend
is configured (e.g. MLFLOW_TRACKING_URI is non-empty or
FLYTE_INTERNAL_EXECUTION_PROJECT is set); implement a simple conditional that
detects those env vars and sets trainer.logger to '["console","mlflow"]' when
true and '["console"]' otherwise, referencing the trainer.logger assignment and
the sourced setup_mlflow_hf.sh for placement.

Hejian Sang and others added 2 commits May 30, 2026 07:52
Commit 65f827f set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to
relieve the reverse-KL / JSD path fragmentation pressure inside
_opsd_training_step. But sglang's torch_memory_saver hard-asserts
against this allocator in _ensure_initialized:

    RuntimeError: TorchMemorySaver is disabled for the current process
    because expandable_segments is not supported yet.

triggered from sglang.srt.model_executor.model_runner.load_model ->
torch_memory_saver/entrypoint.py:199. This crashes the SGLang rollout
during model loading, which kills the Master replica and fails the
PyTorchJob with an opaque CalledProcessError.

Observed in tu50/tu100 retries (Flyte f486fd4845df14f5b930 /
f782dcc0ede2e4c1babf), which were the only resubmissions whose
mldev tarball was uploaded after 65f827f. tu1/tu10/tu20 retries
uploaded before 65f827f and proceeded normally.

Leave a comment in place so future-me doesn't re-add this. The
underlying reverse-KL OOM will be addressed in opsd_worker.py via
smaller chunk_size in _compute_*_loss_liger.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both tu10 (f2e8314b96b034161b5b) and tu20 (f6c97d63f75ee4383837) OOMed
in actor_rollout_ref_update_opsd with peaks ~72 GiB on H100 80 GB. The
hot allocations are the per-chunk fp32 intermediates inside the JSD /
reverse-KL paths in opsd_worker._compute_*_loss[_liger]:

  - log_softmax, exp, mixture, kl_t, kl_s — up to ~6 buffers of
    (chunk_size, V=151,936) fp32 each
  - at chunk_size=256 that's 6 * 256 * 152K * 4 = ~935 MiB per
    transient buffer; the autograd graph retains the student-side
    versions until backward, compounding pressure

Cut all chunk_size defaults to 1/4:
  - standard JSD / reverse-KL: 512 -> 128
  - liger JSD / reverse-KL:    256 -> 64
  - standard entropy:          512 -> 128
  - liger entropy:             256 -> 64

This shrinks each transient ~234 MiB (chunk_size=128, fp32) or ~117 MiB
(chunk_size=64), giving ~3-4 GiB peak-memory relief in the JSD path.
Loss math is unchanged — chunking is purely an algorithmic device.

Per the goal's OOM directive, also halve train_batch_size 8 -> 4 in
qwen3-14b-opsd-length-prune-{tu10,tu20}.json. With dp_world=2 the
padding path stays clean (4 % 2 == 0). Effective compute per step is
the same (micro_batch_size=1 means each step still backprops one
sample at a time); only the gradient-accumulation count changes from
4 to 2.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant