P-SIFT: precondition-and-sketch via modified projections (Louis's review of #275)#288
Conversation
Nah not an issue |
| "torch>=2.5,<2.10", | ||
| # torchvision/torchaudio aren't used by bergson directly, but transformers' | ||
| # image_utils does an unconditional `from torchvision.io import ...` at | ||
| # import time when torchvision is present. RunPod (and most CUDA base |
There was a problem hiding this comment.
I don't think we should reference RunPod in particular in our library
| - repo: https://github.com/astral-sh/ruff-pre-commit | ||
| rev: 'v0.15.13' | ||
| rev: 'v0.15.14' | ||
| hooks: |
There was a problem hiding this comment.
Do we need to change this here?
| g, | ||
| "sum", | ||
| dist.distributed_c10d._get_default_group(), | ||
| k: wait_tensor( |
There was a problem hiding this comment.
Are these changes intentional? We might need to rebase
|
Thanks @jammastergirish! |
Oh sorry. I thought I had. Will do shortly. |
Per Louis Jaburi's review on #275: compose the random projection with H^{-1/2} into a single saved matrix M = R · cov^{-1/2}, then use M wherever the gradient collector used to sample R. Query and training gradients both go through M, so <ĝ_z, ĝ_q> ≈ <g_z, H⁻¹ g_q> without ever applying H⁻¹ in low-dim space. - eigenvectors.py: persist per-side eigenvalues to eigval_{activation, gradient}_sharded/ alongside the existing eigenvectors. - apply_hessian.py: build_kfac_projections builds M = R · Q · diag((E + λ·mean(E))^{-1/2}) · Qᵀ per (layer, side) and writes projection_{left,right}_sharded/. _apply_compressed applies ĝ_q = M_l · G_q · M_rᵀ. Legacy rotate-divide-rotate path is unchanged. Early guard rejects ev_correction + compression (the joint S⊗A eigenvalue correction breaks the Kronecker structure). - config.py + gradient_collectors.py: new IndexConfig.kfac_projection_path pre-populates processor._projection_matrices from disk in GradientCollector.setup(), so HookCollectorBase.projection() returns the saved M instead of sampling a fresh R at hook time. - pipeline.py: visible Step 3.5 builds M and projects the query; Step 4 sets kfac_projection_path so the training-side collector uses the same M. Top-level guard fails fast on projection_dim > 0 + ev_correction=True before any work runs. Untested on GPU (Mac, no CUDA); smoke yaml at runs/p_sift_smoke.yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
create_projection_matrix returns unit-norm rows, so E[RᵀR] = (p/d)·I rather than the identity an unbiased JL sketch requires. For the single global projection this p/d factor is a harmless global constant, but P-SIFT applies the projection per-side and per-layer with differing d, so the score picked up a per-layer p²/(d_left·d_right) reweighting that varies ~4x across gpt2 layer shapes and corrupted ranking vs the exact projection_dim=0 path. Rescale R by √(d/p) in build_kfac_projections to restore E[RᵀR] = I. Verified by projection_dim sweep against the exact (legacy) scores: Spearman ρ vs legacy went from a declining 0.60/0.23/0.30/0.13 (bug) to a climbing 0.64/0.85/0.985/0.994 for p = 16/32/64/128, confirming the sketch now converges to H⁻¹ as p grows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The deferred import inside GradientCollector.setup() wasn't avoiding a circular import (nothing imports gradient_collectors, and apply_hessian's import closure never reaches it), so hoist it alongside the other bergson imports per the project convention of imports-at-top. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
c99811e to
208616f
Compare
| rank = dist.get_rank() if dist.is_initialized() else 0 | ||
| world_size = dist.get_world_size() if dist.is_initialized() else 1 |
There was a problem hiding this comment.
make sure you use the correct rank and world size (important for multi node setups)
There was a problem hiding this comment.
(it seems like they should be the global ones?)
|
|
||
| names = list(side_dims["left"].keys()) | ||
|
|
||
| per_layer_dim = {n: max(side_dims["left"][n], side_dims["right"][n]) for n in names} |
There was a problem hiding this comment.
name is somewhat confusing and I am not sure I understand what this is doing. If you want to fairly distribute by cost, then looking at the max is not necessarily correct?
| # create_projection_matrix returns unit-norm rows, so E[RᵀR] = | ||
| # (p/d)·I. An unbiased sketch needs E[RᵀR] = I, so rescale by | ||
| # √(d/p); otherwise each side carries a p/d factor and the score | ||
| # picks up a per-layer p²/(d_left·d_right) reweighting that | ||
| # corrupts ranking relative to the exact (projection_dim=0) path. | ||
| R = R * (d / projection_dim) ** 0.5 |
There was a problem hiding this comment.
not sure how relevant that is, we look at relative rankings only anyway
There was a problem hiding this comment.
I suppose we can leave it there
| grad_buffer.flush() | ||
| self.logger.info(f"Saved sketched IVHP gradients to {self.cfg.run_path}") | ||
|
|
||
| def _apply_legacy(self): |
| index_cfg.distributed, | ||
| ) | ||
| else: | ||
| # ── Step 3 (legacy): Apply inverse Hessian via rotate-divide-rotate ── |
There was a problem hiding this comment.
I think all these are not legacy, maybe I am missing something here
| processor_path: str = "" | ||
| """Path to a precomputed processor.""" | ||
|
|
||
| kfac_projection_path: str = "" |
There was a problem hiding this comment.
not sure we should even make this configurable? Can't imagine a situation where we would need to change it
|
Left a few comments, I think we are almost there! We should also for now raise Error when doing lora + random projections because in that case one side should not be compressed, but we don't handle that right now. Would be a good follow up PR |
Stacked on top of #275 — addresses @LouisYRYJ' comments from last week's discussion.
What's different from #275
Our discussion and the inline comments asked for the precondition-and-sketch approach instead: compose the random projection with
H^{-1/2}into a single saved matrixM = R · cov^{-1/2}, then useMwherever the gradient collector used to sampleR. Query and training gradients both go throughM, so<ĝ_z, ĝ_q> ≈ <g_z, H⁻¹ g_q>without ever applyingH⁻¹in low-dim space.This is as per
and
Comment-to-code map
Pulled from comments in #275
eigenvectors.pynow writeseigval_{activation,gradient}_sharded/next to the existing eigvec dirsapply_hessian.py:build_kfac_projectionsformsM = R · Q · diag((E + λ·mean(E))^{-1/2}) · Qᵀper (layer, side)build_kfac_projectionsreadseigen_*/eigval_*which kfac/tkfac/shampoo all produce via the sharedhessian_workerpipeline.pyraises immediately onprojection_dim > 0 + ev_correction=True; secondary guard incompute_ivhp_shardedapply_hessian.py:38— "TODO: H⁻¹/² multiplications"D = (E + damp).clamp_min(...).rsqrt()→M = R @ (Q * D) @ Q.Tpipeline.py:43— "throw error compression + ev_correction"pipeline.py:43— "also works for shampoo"pipeline.py:115— "Step 3.5: Compute R @ H⁻¹/²"Step 3.5/4print +_timedblock + dedicatedbuild_projections_workerpipeline.py:115— "right and left projection separately"SIDE_TO_COV = {"left": "gradient", "right": "activation"}; build loop produces distinct M_left (with S^{-1/2}) and M_right (with A^{-1/2})pipeline.py:115— "save to disk"save_file(..., projection_{side}_sharded/shard_{rank}.safetensors)pipeline.py:126— "find where random projections multiply with grads, add a config to use the saved ones"IndexConfig.kfac_projection_path;GradientCollector.setup()callsload_kfac_projectionswhich pre-populatesprocessor._projection_matricessoHookCollectorBase.projection()returns saved M, not a fresh Rpt)""pt"remaining is the safetensors framework argVerification
Smoke config: gpt2, wikitext
train[:128]as the K-FAC fitting set,test[:4]as the query,ev_correction=False.Single GPU
pytest tests/ekfac_tests/).M·G·Mᵀ; produces 256 well-formed scores (no NaN, heavy-tailed both signs).projection_dim=0) climbs monotonically:Clean convergence to ground truth → math is right; low-p noise is exactly what you'd expect from a 16-dim sketch of 768-dim per-layer matrices.
Multi-GPU (2× A40)
Sharded pipeline completes cleanly — exercises
build_kfac_projections,fair_distribute_by_cost, per-rank shard writes, multi-shard load in_apply_compressedandload_kfac_projections.dist.all_reduce(SUM)during covariance accumulation produces a slightly differentMper layout, and that small difference gets amplified by a low-p sketch. Rank correlation is what matters for retrieval and it's solid by p ≥ 32.Note for reviewers
There's one harmless
UserWarningabout a non-writable NumPy array inapply_hessian.py:259(the mmap is read-only and we never write to that tensor). Can fix with a.copy()if desired; left as-is to keep the diff focused on Louis's comments.🤖 Generated with Claude Code