ConvNeXt (and possibly other modern models) implement 1x1 convolutions as linear layers broadcasted over height and width because this is slightly more performant (according to a comment in the model repo).
E.g. from the ConvNeXt implementation in the FacebookResearch Repo:
Line 30: self.pwconv1 = nn.Linear(dim, 4 * dim) # pointwise/1x1 convs, implemented with linear layers
In order to achieve this, dimensions are temporarily permuted:
Line 40: x = x.permute(0, 2, 3, 1) # (N, C, H, W) -> (N, H, W, C)
This leads to activations that match with the Conv2D case in the _package_layer function line 281-282 but do not have channels first:
elif flatten_indices.shape[1] == 3: # 2DConv, e.g. resnet
flatten_coord_names = ['channel', 'channel_x', 'channel_y']
Thus for models that have committed a 1x1 conv layer for any brain region, the wrong shape is inferred.
E.g. for ConvNeXt_xlarge (from the timm implementation, but same principle) V1-region, it infers
channels: 14
x-dim: 14
y-dim: 4096
