K-FAC Hessian + include_bias=True produces incompatible gradient/covariance shapes

Found/generated by Claude during PR 275.

## Summary

When `bergson build` is run with `include_bias=True` and the Hessian step uses `method='kfac'`, `apply_hessian` (and the scoring paths downstream of it) fails at a per-layer reshape: the gradient store has one extra column per layer that the K-FAC activation covariance doesn't account for.

This is a **pre-existing** limitation independent of the K-FAC compression work in #275 — the same failure happens on `main`. Filing separately so the fix doesn't get tangled with the compression PR.

## What goes wrong (concrete shapes)

For each linear layer `nn.Linear(I, O, bias=True)`:

- **`bergson build`** with `include_bias=True` stores per-sample gradients of shape `[O, I+1]` per layer — the bias gradient is concatenated as an extra "activation" column. (`HookCollectorBase.shapes()` at `bergson/collector/collector.py:264-270` sets `grad_shape[-1] += 1` when `collect_bias`; `_compute_gradient` does the matching `torch.cat`.)
- **K-FAC `CovarianceCollector`** at `bergson/hessians/kfac.py` computes `A = aᵀa` from the raw forward input `a: [N·S, I]` — **no bias column**. The `collect_bias` flag is unpacked in `_init_covariance_dict` (`bergson/hessians/sharded_computation.py:28`) but never used. Result: `A: [I, I]`.
- At apply time, `compute_ivhp_sharded` reshapes the loaded query gradient via:
  ```python
  gradients_noi.view(-1, eigen_g[k].shape[1], eigen_a[k].shape[1])
  # = view(-1, O, I)
  # but stored flat size is N·O·(I+1) → reshape error
  ```

## Repro

```sh
bergson build --processor.include_bias true ...
bergson approximate-hessians --hessian_cfg.method kfac ...
bergson apply-hessian ...    # raises at .view(-1, O, I)
```

## Fix sketch

Teach K-FAC's covariance collection to operate on the **augmented activation** `[a; 1]` of shape `[N·S, I+1]` when the layer's bias is being collected:

- `CovarianceCollector.forward_hook` appends a 1-column to `a` (matching the build-time gradient layout) before `aᵀa`, giving `A: [I+1, I+1]`.
- `_init_covariance_dict` sizes the activation covariance as `[I+1, I+1]` when `collect_bias=True`.
- Downstream (`compute_eigendecomposition`, `compute_whitening_projection_matrices`, `apply_hessian`) all derive `d_A` from `A.shape[-1]`, so they pick up the new dimension automatically.

The gradient covariance `S: [O, O]` is unchanged.

## Scope

- Out of scope for #275 — that PR is about compression and inherits whatever fix lands here for free.
- In scope for whoever picks this up: a fix here also unblocks K-FAC + bias on `main` (legacy IVHP path).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K-FAC Hessian + include_bias=True produces incompatible gradient/covariance shapes #277

Summary

What goes wrong (concrete shapes)

Repro

Fix sketch

Scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

K-FAC Hessian + include_bias=True produces incompatible gradient/covariance shapes #277

Description

Summary

What goes wrong (concrete shapes)

Repro

Fix sketch

Scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions