Skip to content

Comments

fix: prevent wandb hook accumulation across reruns#141

Open
suhr25 wants to merge 1 commit intoML4SCI:mainfrom
suhr25:fix/wandb-watch-hook-leak
Open

fix: prevent wandb hook accumulation across reruns#141
suhr25 wants to merge 1 commit intoML4SCI:mainfrom
suhr25:fix/wandb-watch-hook-leak

Conversation

@suhr25
Copy link

@suhr25 suhr25 commented Feb 20, 2026

Summary

The training pipelines across multiple modules (train.py, utils/train.py, and train_ray.py) were calling wandb.watch() inside the train() function.

This leads to repeated hook registration on the same model during:

  • Notebook reruns
  • Interrupted training (Ctrl+C)
  • Hyperparameter sweeps (Ray Tune)

Since these hooks are not always cleaned up reliably (especially in notebook or interrupted execution paths), they accumulate over time.

Problematic Pattern

wandb.watch(model, criterion, log="all", log_freq=log_freq)

Each call registers forward and backward hooks on model parameters, which:

  • Add overhead to every loss.backward()
  • Increase CPU usage due to histogram computation
  • Continuously grow W&B event logs on disk

Fix

Two targeted fixes were applied to make this safe and idempotent:

1. Explicit Hook Cleanup Before Re-registering

wandb.unwatch(model)
wandb.watch(model, criterion, log="gradients", log_freq=log_freq)
  • wandb.unwatch(model) ensures stale hooks are removed
  • Prevents duplicate hook accumulation across runs

2. Reduced Logging Scope

# Before
log="all"

# After
log="gradients"
  • Removes forward hooks (weight histograms)
  • Keeps only gradient logging (sufficient for debugging)
  • Reduces CPU + disk overhead significantly

Files Updated

  • DeepLense_Classification_Transformers_Archil_Srivastava/train.py
  • Transformers_Classification_DeepLense_Kartik_Sachdev/utils/train.py
  • Transformers_Classification_DeepLense_Kartik_Sachdev/utils/train_ray.py

Verification

The fix was validated through repeated real-world workflows:

✅ Notebook Reruns

  • Re-ran training multiple times without restarting kernel
  • No increase in training time observed

✅ Hook Stability Check

for name, param in model.named_parameters():
    print(len(param._backward_hooks))
  • Hook count remains stable (no accumulation)

✅ Performance Consistency

  • loss.backward() time remains constant across runs
  • CPU usage does not increase over time

✅ W&B Logging Behavior

  • wandb/run-* directory growth is stable
  • No excessive event file generation

Impact

  • Eliminates silent performance degradation during iterative experimentation
  • Prevents exponential growth in W&B logs
  • Ensures stable training time across reruns and sweeps
  • Makes training pipelines safe for notebook-based workflows and Ray Tune trials

This change addresses a subtle but high-impact issue affecting reproducibility, performance, and resource usage in long-running ML workflows.

Replace wandb.watch(log="all") with unwatch+watch(log="gradients")
in all three training scripts to avoid duplicate backward/forward
hook registration on notebook reruns and sweep trials.

Signed-off-by: suhr25 <suhridmarwah07@gmail.com>
@suhr25
Copy link
Author

suhr25 commented Feb 20, 2026

Hi @ereinha, this resolves duplicate hook accumulation by calling wandb.unwatch(model) before wandb.watch().
Also switches to log="gradients" to reduce logging overhead and prevent excessive growth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant