fix: prevent wandb hook accumulation across reruns by suhr25 · Pull Request #141 · ML4SCI/DeepLense

suhr25 · 2026-02-20T22:13:50Z

Summary

The training pipelines across multiple modules (train.py, utils/train.py, and train_ray.py) were calling wandb.watch() inside the train() function.

This leads to repeated hook registration on the same model during:

Notebook reruns
Interrupted training (Ctrl+C)
Hyperparameter sweeps (Ray Tune)

Since these hooks are not always cleaned up reliably (especially in notebook or interrupted execution paths), they accumulate over time.

Problematic Pattern

wandb.watch(model, criterion, log="all", log_freq=log_freq)

Each call registers forward and backward hooks on model parameters, which:

Add overhead to every loss.backward()
Increase CPU usage due to histogram computation
Continuously grow W&B event logs on disk

Fix

Two targeted fixes were applied to make this safe and idempotent:

1. Explicit Hook Cleanup Before Re-registering

wandb.unwatch(model)
wandb.watch(model, criterion, log="gradients", log_freq=log_freq)

wandb.unwatch(model) ensures stale hooks are removed
Prevents duplicate hook accumulation across runs

2. Reduced Logging Scope

# Before
log="all"

# After
log="gradients"

Removes forward hooks (weight histograms)
Keeps only gradient logging (sufficient for debugging)
Reduces CPU + disk overhead significantly

Files Updated

DeepLense_Classification_Transformers_Archil_Srivastava/train.py
Transformers_Classification_DeepLense_Kartik_Sachdev/utils/train.py
Transformers_Classification_DeepLense_Kartik_Sachdev/utils/train_ray.py

Verification

The fix was validated through repeated real-world workflows:

✅ Notebook Reruns

Re-ran training multiple times without restarting kernel
No increase in training time observed

✅ Hook Stability Check

for name, param in model.named_parameters():
    print(len(param._backward_hooks))

Hook count remains stable (no accumulation)

✅ Performance Consistency

loss.backward() time remains constant across runs
CPU usage does not increase over time

✅ W&B Logging Behavior

wandb/run-* directory growth is stable
No excessive event file generation

Impact

Eliminates silent performance degradation during iterative experimentation
Prevents exponential growth in W&B logs
Ensures stable training time across reruns and sweeps
Makes training pipelines safe for notebook-based workflows and Ray Tune trials

This change addresses a subtle but high-impact issue affecting reproducibility, performance, and resource usage in long-running ML workflows.

Replace wandb.watch(log="all") with unwatch+watch(log="gradients") in all three training scripts to avoid duplicate backward/forward hook registration on notebook reruns and sweep trials. Signed-off-by: suhr25 <suhridmarwah07@gmail.com>

suhr25 · 2026-02-20T22:15:47Z

Hi @ereinha, this resolves duplicate hook accumulation by calling wandb.unwatch(model) before wandb.watch().
Also switches to log="gradients" to reduce logging overhead and prevent excessive growth.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: prevent wandb hook accumulation across reruns#141

fix: prevent wandb hook accumulation across reruns#141
suhr25 wants to merge 1 commit intoML4SCI:mainfrom
suhr25:fix/wandb-watch-hook-leak

suhr25 commented Feb 20, 2026

Uh oh!

suhr25 commented Feb 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

suhr25 commented Feb 20, 2026

Summary

Problematic Pattern

Fix

1. Explicit Hook Cleanup Before Re-registering

2. Reduced Logging Scope

Files Updated

Verification

✅ Notebook Reruns

✅ Hook Stability Check

✅ Performance Consistency

✅ W&B Logging Behavior

Impact

Uh oh!

suhr25 commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

suhr25 commented Feb 20, 2026 •

edited

Loading