I think there may be a logic issue in fisher.py (around line 113) in the Fisher/influence computation.
The code indexes the activations like:
original_ffn_activations[layer][0]
For Qwen-2.5VL, there is no CLS token, and the vision features are represented as a sequence of visual patches/tokens (batch * n_patches, hidden_size) . In this case, the shape would be (1*256, 3420). Indexing [0] would compute Fisher information and influential paths using only the first visual token (top-left patch), rather than an aggregate over the full image.
Questions:
- Was using only index 0 intentional?
- If not, should the influence be computed over all patches either by mean-pooling across all tokens, or attention-weighted pooling?
I think there may be a logic issue in
fisher.py(around line 113) in the Fisher/influence computation.The code indexes the activations like:
For Qwen-2.5VL, there is no CLS token, and the vision features are represented as a sequence of visual patches/tokens
(batch * n_patches, hidden_size). In this case, the shape would be(1*256, 3420). Indexing[0]would compute Fisher information and influential paths using only the first visual token (top-left patch), rather than an aggregate over the full image.Questions: