P-SIFT: precondition-and-sketch via modified projections (Louis's review of #275) by jammastergirish · Pull Request #288 · EleutherAI/bergson

jammastergirish · 2026-05-25T22:03:56Z

Stacked on top of #275 — addresses @LouisYRYJ' comments from last week's discussion.

What's different from #275

Our discussion and the inline comments asked for the precondition-and-sketch approach instead: compose the random projection with H^{-1/2} into a single saved matrix M = R · cov^{-1/2}, then use M wherever the gradient collector used to sample R. Query and training gradients both go through M, so <ĝ_z, ĝ_q> ≈ <g_z, H⁻¹ g_q> without ever applying H⁻¹ in low-dim space.

This is as per

and

Comment-to-code map

Pulled from comments in #275

Louis's comment	Where it lives
Top-level: "save the eigenvalues of A and S"	`eigenvectors.py` now writes `eigval_{activation,gradient}_sharded/` next to the existing eigvec dirs
Top-level: "compute (damped) inverses then random-project the curvature itself"	`apply_hessian.py:build_kfac_projections` forms `M = R · Q · diag((E + λ·mean(E))^{-1/2}) · Qᵀ` per (layer, side)
Top-level: "works for any method, not just kfac"	Method-agnostic — `build_kfac_projections` reads `eigen_/eigval_` which kfac/tkfac/shampoo all produce via the shared `hessian_worker`
Top-level: "EV correction will not work"	`pipeline.py` raises immediately on `projection_dim > 0 + ev_correction=True`; secondary guard in `compute_ivhp_sharded`
`apply_hessian.py:38` — "TODO: H⁻¹/² multiplications"	`D = (E + damp).clamp_min(...).rsqrt()` → `M = R @ (Q * D) @ Q.T`
`pipeline.py:43` — "throw error compression + ev_correction"	Top-of-pipeline guard
`pipeline.py:43` — "also works for shampoo"	Method-agnostic; same answer as above
`pipeline.py:115` — "Step 3.5: Compute R @ H⁻¹/²"	Labeled `Step 3.5/4` print + `_timed` block + dedicated `build_projections_worker`
`pipeline.py:115` — "right and left projection separately"	`SIDE_TO_COV = {"left": "gradient", "right": "activation"}`; build loop produces distinct M_left (with S^{-1/2}) and M_right (with A^{-1/2})
`pipeline.py:115` — "save to disk"	`save_file(..., projection_{side}_sharded/shard_{rank}.safetensors)`
`pipeline.py:126` — "find where random projections multiply with grads, add a config to use the saved ones"	New `IndexConfig.kfac_projection_path`; `GradientCollector.setup()` calls `load_kfac_projections` which pre-populates `processor._projection_matrices` so `HookCollectorBase.projection()` returns saved M, not a fresh R
Lucia: "too many initialisms (`pt`)"	Renamed throughout; only `"pt"` remaining is the safetensors framework arg
Bias support	Not addressed — Louis flagged "not load-bearing" in the earlier thread

Verification

Smoke config: gpt2, wikitext train[:128] as the K-FAC fitting set, test[:4] as the query, ev_correction=False.

Single GPU

13/13 K-FAC tests pass (pytest tests/ekfac_tests/).
End-to-end pipeline runs clean; saves eigenvectors, eigenvalues, M_left, M_right; applies M·G·Mᵀ; produces 256 well-formed scores (no NaN, heavy-tailed both signs).
Spearman ρ against legacy uncompressed (projection_dim=0) climbs monotonically:

p	ρ vs legacy
16	0.69
32	0.83
64	0.95
128	0.99

Clean convergence to ground truth → math is right; low-p noise is exactly what you'd expect from a 16-dim sketch of 768-dim per-layer matrices.

Multi-GPU (2× A40)
Sharded pipeline completes cleanly — exercises build_kfac_projections, fair_distribute_by_cost, per-rank shard writes, multi-shard load in _apply_compressed and load_kfac_projections.

	1-GPU vs legacy	2-GPU vs legacy	1-GPU vs 2-GPU
Legacy (p=0)	—	—	0.997
p=16	0.69	0.48	0.73
p=32	0.83	0.89	0.91
p=64	0.95	0.96	0.96
p=128	0.99	0.97	0.98

Legacy 1-GPU vs 2-GPU = 0.997 confirms the distributed aggregation pipeline is sound independent of P-SIFT.
Both P-SIFT runs converge to ground truth as p grows; they also converge to each other.
Bit-identical scores across GPU counts are not guaranteed: FP non-associativity in dist.all_reduce(SUM) during covariance accumulation produces a slightly different M per layout, and that small difference gets amplified by a low-p sketch. Rank correlation is what matters for retrieval and it's solid by p ≥ 32.

Note for reviewers

There's one harmless UserWarning about a non-writable NumPy array in apply_hessian.py:259 (the mmap is read-only and we never write to that tensor). Can fix with a .copy() if desired; left as-is to keep the diff focused on Louis's comments.

🤖 Generated with Claude Code

luciaquirke · 2026-05-29T04:10:06Z

Can fix with a .copy() if desired

Nah not an issue

luciaquirke · 2026-05-29T04:11:18Z

+    "torch>=2.5,<2.10",
+    # torchvision/torchaudio aren't used by bergson directly, but transformers'
+    # image_utils does an unconditional `from torchvision.io import ...` at
+    # import time when torchvision is present. RunPod (and most CUDA base


I don't think we should reference RunPod in particular in our library

luciaquirke · 2026-05-29T04:11:51Z

 -   repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: 'v0.15.13'
+    rev: 'v0.15.14'
    hooks:


Do we need to change this here?

luciaquirke · 2026-05-29T04:12:47Z

-                    g,
-                    "sum",
-                    dist.distributed_c10d._get_default_group(),
+                k: wait_tensor(


Are these changes intentional? We might need to rebase

LouisYRYJ · 2026-06-02T22:43:12Z

Thanks @jammastergirish!
Looks all correct at first glance, maybe you can rebase before I took a close look?

jammastergirish · 2026-06-02T22:48:26Z

Thanks @jammastergirish! Looks all correct at first glance, maybe you can rebase before I took a close look?

Oh sorry. I thought I had. Will do shortly.

Per Louis Jaburi's review on #275: compose the random projection with H^{-1/2} into a single saved matrix M = R · cov^{-1/2}, then use M wherever the gradient collector used to sample R. Query and training gradients both go through M, so <ĝ_z, ĝ_q> ≈ <g_z, H⁻¹ g_q> without ever applying H⁻¹ in low-dim space. - eigenvectors.py: persist per-side eigenvalues to eigval_{activation, gradient}_sharded/ alongside the existing eigenvectors. - apply_hessian.py: build_kfac_projections builds M = R · Q · diag((E + λ·mean(E))^{-1/2}) · Qᵀ per (layer, side) and writes projection_{left,right}_sharded/. _apply_compressed applies ĝ_q = M_l · G_q · M_rᵀ. Legacy rotate-divide-rotate path is unchanged. Early guard rejects ev_correction + compression (the joint S⊗A eigenvalue correction breaks the Kronecker structure). - config.py + gradient_collectors.py: new IndexConfig.kfac_projection_path pre-populates processor._projection_matrices from disk in GradientCollector.setup(), so HookCollectorBase.projection() returns the saved M instead of sampling a fresh R at hook time. - pipeline.py: visible Step 3.5 builds M and projects the query; Step 4 sets kfac_projection_path so the training-side collector uses the same M. Top-level guard fails fast on projection_dim > 0 + ev_correction=True before any work runs. Untested on GPU (Mac, no CUDA); smoke yaml at runs/p_sift_smoke.yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

create_projection_matrix returns unit-norm rows, so E[RᵀR] = (p/d)·I rather than the identity an unbiased JL sketch requires. For the single global projection this p/d factor is a harmless global constant, but P-SIFT applies the projection per-side and per-layer with differing d, so the score picked up a per-layer p²/(d_left·d_right) reweighting that varies ~4x across gpt2 layer shapes and corrupted ranking vs the exact projection_dim=0 path. Rescale R by √(d/p) in build_kfac_projections to restore E[RᵀR] = I. Verified by projection_dim sweep against the exact (legacy) scores: Spearman ρ vs legacy went from a declining 0.60/0.23/0.30/0.13 (bug) to a climbing 0.64/0.85/0.985/0.994 for p = 16/32/64/128, confirming the sketch now converges to H⁻¹ as p grows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The deferred import inside GradientCollector.setup() wasn't avoiding a circular import (nothing imports gradient_collectors, and apply_hessian's import closure never reaches it), so hoist it alongside the other bergson imports per the project convention of imports-at-top. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

LouisYRYJ · 2026-06-03T20:52:34Z

+    rank = dist.get_rank() if dist.is_initialized() else 0
+    world_size = dist.get_world_size() if dist.is_initialized() else 1


make sure you use the correct rank and world size (important for multi node setups)

(it seems like they should be the global ones?)

LouisYRYJ · 2026-06-03T20:55:36Z

+
+    names = list(side_dims["left"].keys())
+
+    per_layer_dim = {n: max(side_dims["left"][n], side_dims["right"][n]) for n in names}


name is somewhat confusing and I am not sure I understand what this is doing. If you want to fairly distribute by cost, then looking at the max is not necessarily correct?

LouisYRYJ · 2026-06-03T20:58:11Z

+            # create_projection_matrix returns unit-norm rows, so E[RᵀR] =
+            # (p/d)·I. An unbiased sketch needs E[RᵀR] = I, so rescale by
+            # √(d/p); otherwise each side carries a p/d factor and the score
+            # picks up a per-layer p²/(d_left·d_right) reweighting that
+            # corrupts ranking relative to the exact (projection_dim=0) path.
+            R = R * (d / projection_dim) ** 0.5


not sure how relevant that is, we look at relative rankings only anyway

I suppose we can leave it there

LouisYRYJ · 2026-06-03T21:00:29Z

+        grad_buffer.flush()
+        self.logger.info(f"Saved sketched IVHP gradients to {self.cfg.run_path}")
+
+    def _apply_legacy(self):


why is this legacy?

LouisYRYJ · 2026-06-03T21:01:54Z

+                    index_cfg.distributed,
+                )
+    else:
+        # ── Step 3 (legacy): Apply inverse Hessian via rotate-divide-rotate ──


I think all these are not legacy, maybe I am missing something here

LouisYRYJ · 2026-06-03T21:02:33Z

    processor_path: str = ""
    """Path to a precomputed processor."""

+    kfac_projection_path: str = ""


not sure we should even make this configurable? Can't imagine a situation where we would need to change it

LouisYRYJ · 2026-06-03T21:03:35Z

Left a few comments, I think we are almost there!

We should also for now raise Error when doing lora + random projections because in that case one side should not be compressed, but we don't handle that right now. Would be a good follow up PR

jammastergirish marked this pull request as draft May 25, 2026 22:04

jammastergirish requested a review from LouisYRYJ May 26, 2026 17:51

jammastergirish marked this pull request as ready for review May 26, 2026 17:51

luciaquirke reviewed May 29, 2026

View reviewed changes

Comment thread bergson/collector/gradient_collectors.py Outdated

luciaquirke reviewed May 29, 2026

View reviewed changes

jammastergirish and others added 8 commits June 2, 2026 16:27

Deletes unnecessary comments

0fcd1d4

Removes unnecessary comment

ebd6169

Refactor

0177dc6

Repeated line

dd86adb

Adds ValueError

208616f

jammastergirish force-pushed the feat/p-sift-modified-projections branch from c99811e to 208616f Compare June 2, 2026 23:47

LouisYRYJ reviewed Jun 3, 2026

View reviewed changes

		rank = dist.get_rank() if dist.is_initialized() else 0
		world_size = dist.get_world_size() if dist.is_initialized() else 1


		names = list(side_dims["left"].keys())

		per_layer_dim = {n: max(side_dims["left"][n], side_dims["right"][n]) for n in names}

Conversation

jammastergirish commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's different from #275

Comment-to-code map

Verification

Note for reviewers

Uh oh!

luciaquirke commented May 29, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LouisYRYJ commented Jun 2, 2026

Uh oh!

jammastergirish commented Jun 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LouisYRYJ commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jammastergirish commented May 25, 2026 •

edited

Loading