fix: prevent wandb hook accumulation across reruns#141
Open
suhr25 wants to merge 1 commit intoML4SCI:mainfrom
Open
fix: prevent wandb hook accumulation across reruns#141suhr25 wants to merge 1 commit intoML4SCI:mainfrom
suhr25 wants to merge 1 commit intoML4SCI:mainfrom
Conversation
Replace wandb.watch(log="all") with unwatch+watch(log="gradients") in all three training scripts to avoid duplicate backward/forward hook registration on notebook reruns and sweep trials. Signed-off-by: suhr25 <suhridmarwah07@gmail.com>
Author
|
Hi @ereinha, this resolves duplicate hook accumulation by calling wandb.unwatch(model) before wandb.watch(). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The training pipelines across multiple modules (
train.py,utils/train.py, andtrain_ray.py) were callingwandb.watch()inside thetrain()function.This leads to repeated hook registration on the same model during:
Since these hooks are not always cleaned up reliably (especially in notebook or interrupted execution paths), they accumulate over time.
Problematic Pattern
Each call registers forward and backward hooks on model parameters, which:
loss.backward()Fix
Two targeted fixes were applied to make this safe and idempotent:
1. Explicit Hook Cleanup Before Re-registering
wandb.unwatch(model)ensures stale hooks are removed2. Reduced Logging Scope
Files Updated
DeepLense_Classification_Transformers_Archil_Srivastava/train.pyTransformers_Classification_DeepLense_Kartik_Sachdev/utils/train.pyTransformers_Classification_DeepLense_Kartik_Sachdev/utils/train_ray.pyVerification
The fix was validated through repeated real-world workflows:
✅ Notebook Reruns
✅ Hook Stability Check
✅ Performance Consistency
loss.backward()time remains constant across runs✅ W&B Logging Behavior
wandb/run-*directory growth is stableImpact
This change addresses a subtle but high-impact issue affecting reproducibility, performance, and resource usage in long-running ML workflows.