Skip to content

P-SIFT: precondition-and-sketch via modified projections (Louis's review of #275)#288

Open
jammastergirish wants to merge 8 commits into
feat/first-class-kfac-build-supportfrom
feat/p-sift-modified-projections
Open

P-SIFT: precondition-and-sketch via modified projections (Louis's review of #275)#288
jammastergirish wants to merge 8 commits into
feat/first-class-kfac-build-supportfrom
feat/p-sift-modified-projections

Conversation

@jammastergirish

@jammastergirish jammastergirish commented May 25, 2026

Copy link
Copy Markdown
Collaborator

Stacked on top of #275 — addresses @LouisYRYJ' comments from last week's discussion.

What's different from #275

Our discussion and the inline comments asked for the precondition-and-sketch approach instead: compose the random projection with H^{-1/2} into a single saved matrix M = R · cov^{-1/2}, then use M wherever the gradient collector used to sample R. Query and training gradients both go through M, so <ĝ_z, ĝ_q> ≈ <g_z, H⁻¹ g_q> without ever applying H⁻¹ in low-dim space.

This is as per

image

and

image

Comment-to-code map

Pulled from comments in #275

Louis's comment Where it lives
Top-level: "save the eigenvalues of A and S" eigenvectors.py now writes eigval_{activation,gradient}_sharded/ next to the existing eigvec dirs
Top-level: "compute (damped) inverses then random-project the curvature itself" apply_hessian.py:build_kfac_projections forms M = R · Q · diag((E + λ·mean(E))^{-1/2}) · Qᵀ per (layer, side)
Top-level: "works for any method, not just kfac" Method-agnostic — build_kfac_projections reads eigen_*/eigval_* which kfac/tkfac/shampoo all produce via the shared hessian_worker
Top-level: "EV correction will not work" pipeline.py raises immediately on projection_dim > 0 + ev_correction=True; secondary guard in compute_ivhp_sharded
apply_hessian.py:38 — "TODO: H⁻¹/² multiplications" D = (E + damp).clamp_min(...).rsqrt()M = R @ (Q * D) @ Q.T
pipeline.py:43 — "throw error compression + ev_correction" Top-of-pipeline guard
pipeline.py:43 — "also works for shampoo" Method-agnostic; same answer as above
pipeline.py:115 — "Step 3.5: Compute R @ H⁻¹/²" Labeled Step 3.5/4 print + _timed block + dedicated build_projections_worker
pipeline.py:115 — "right and left projection separately" SIDE_TO_COV = {"left": "gradient", "right": "activation"}; build loop produces distinct M_left (with S^{-1/2}) and M_right (with A^{-1/2})
pipeline.py:115 — "save to disk" save_file(..., projection_{side}_sharded/shard_{rank}.safetensors)
pipeline.py:126 — "find where random projections multiply with grads, add a config to use the saved ones" New IndexConfig.kfac_projection_path; GradientCollector.setup() calls load_kfac_projections which pre-populates processor._projection_matrices so HookCollectorBase.projection() returns saved M, not a fresh R
Lucia: "too many initialisms (pt)" Renamed throughout; only "pt" remaining is the safetensors framework arg
Bias support Not addressed — Louis flagged "not load-bearing" in the earlier thread

Verification

Smoke config: gpt2, wikitext train[:128] as the K-FAC fitting set, test[:4] as the query, ev_correction=False.

Single GPU

  • 13/13 K-FAC tests pass (pytest tests/ekfac_tests/).
  • End-to-end pipeline runs clean; saves eigenvectors, eigenvalues, M_left, M_right; applies M·G·Mᵀ; produces 256 well-formed scores (no NaN, heavy-tailed both signs).
  • Spearman ρ against legacy uncompressed (projection_dim=0) climbs monotonically:
p ρ vs legacy
16 0.69
32 0.83
64 0.95
128 0.99

Clean convergence to ground truth → math is right; low-p noise is exactly what you'd expect from a 16-dim sketch of 768-dim per-layer matrices.

Multi-GPU (2× A40)
Sharded pipeline completes cleanly — exercises build_kfac_projections, fair_distribute_by_cost, per-rank shard writes, multi-shard load in _apply_compressed and load_kfac_projections.

1-GPU vs legacy 2-GPU vs legacy 1-GPU vs 2-GPU
Legacy (p=0) 0.997
p=16 0.69 0.48 0.73
p=32 0.83 0.89 0.91
p=64 0.95 0.96 0.96
p=128 0.99 0.97 0.98
  • Legacy 1-GPU vs 2-GPU = 0.997 confirms the distributed aggregation pipeline is sound independent of P-SIFT.
  • Both P-SIFT runs converge to ground truth as p grows; they also converge to each other.
  • Bit-identical scores across GPU counts are not guaranteed: FP non-associativity in dist.all_reduce(SUM) during covariance accumulation produces a slightly different M per layout, and that small difference gets amplified by a low-p sketch. Rank correlation is what matters for retrieval and it's solid by p ≥ 32.

Note for reviewers

There's one harmless UserWarning about a non-writable NumPy array in apply_hessian.py:259 (the mmap is read-only and we never write to that tensor). Can fix with a .copy() if desired; left as-is to keep the diff focused on Louis's comments.

🤖 Generated with Claude Code

@jammastergirish jammastergirish marked this pull request as draft May 25, 2026 22:04
@jammastergirish jammastergirish requested a review from LouisYRYJ May 26, 2026 17:51
@jammastergirish jammastergirish marked this pull request as ready for review May 26, 2026 17:51
@luciaquirke

Copy link
Copy Markdown
Collaborator

Can fix with a .copy() if desired

Nah not an issue

Comment thread bergson/collector/gradient_collectors.py Outdated
Comment thread pyproject.toml
"torch>=2.5,<2.10",
# torchvision/torchaudio aren't used by bergson directly, but transformers'
# image_utils does an unconditional `from torchvision.io import ...` at
# import time when torchvision is present. RunPod (and most CUDA base

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should reference RunPod in particular in our library

Comment thread .pre-commit-config.yaml
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: 'v0.15.13'
rev: 'v0.15.14'
hooks:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to change this here?

Comment thread bergson/magic/cli.py
g,
"sum",
dist.distributed_c10d._get_default_group(),
k: wait_tensor(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes intentional? We might need to rebase

@LouisYRYJ

Copy link
Copy Markdown
Contributor

Thanks @jammastergirish!
Looks all correct at first glance, maybe you can rebase before I took a close look?

@jammastergirish

Copy link
Copy Markdown
Collaborator Author

Thanks @jammastergirish! Looks all correct at first glance, maybe you can rebase before I took a close look?

Oh sorry. I thought I had. Will do shortly.

jammastergirish and others added 8 commits June 2, 2026 16:27
Per Louis Jaburi's review on #275: compose the random projection with
H^{-1/2} into a single saved matrix M = R · cov^{-1/2}, then use M
wherever the gradient collector used to sample R. Query and training
gradients both go through M, so <ĝ_z, ĝ_q> ≈ <g_z, H⁻¹ g_q> without
ever applying H⁻¹ in low-dim space.

- eigenvectors.py: persist per-side eigenvalues to eigval_{activation,
  gradient}_sharded/ alongside the existing eigenvectors.
- apply_hessian.py: build_kfac_projections builds
  M = R · Q · diag((E + λ·mean(E))^{-1/2}) · Qᵀ per (layer, side) and
  writes projection_{left,right}_sharded/. _apply_compressed applies
  ĝ_q = M_l · G_q · M_rᵀ. Legacy rotate-divide-rotate path is unchanged.
  Early guard rejects ev_correction + compression (the joint S⊗A
  eigenvalue correction breaks the Kronecker structure).
- config.py + gradient_collectors.py: new IndexConfig.kfac_projection_path
  pre-populates processor._projection_matrices from disk in
  GradientCollector.setup(), so HookCollectorBase.projection() returns
  the saved M instead of sampling a fresh R at hook time.
- pipeline.py: visible Step 3.5 builds M and projects the query;
  Step 4 sets kfac_projection_path so the training-side collector uses
  the same M. Top-level guard fails fast on projection_dim > 0 +
  ev_correction=True before any work runs.

Untested on GPU (Mac, no CUDA); smoke yaml at runs/p_sift_smoke.yaml.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
create_projection_matrix returns unit-norm rows, so E[RᵀR] = (p/d)·I
rather than the identity an unbiased JL sketch requires. For the single
global projection this p/d factor is a harmless global constant, but
P-SIFT applies the projection per-side and per-layer with differing d,
so the score picked up a per-layer p²/(d_left·d_right) reweighting that
varies ~4x across gpt2 layer shapes and corrupted ranking vs the exact
projection_dim=0 path.

Rescale R by √(d/p) in build_kfac_projections to restore E[RᵀR] = I.

Verified by projection_dim sweep against the exact (legacy) scores:
Spearman ρ vs legacy went from a declining 0.60/0.23/0.30/0.13 (bug) to
a climbing 0.64/0.85/0.985/0.994 for p = 16/32/64/128, confirming the
sketch now converges to H⁻¹ as p grows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The deferred import inside GradientCollector.setup() wasn't avoiding a circular import (nothing imports gradient_collectors, and apply_hessian's import closure never reaches it), so hoist it alongside the other bergson imports per the project convention of imports-at-top.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jammastergirish jammastergirish force-pushed the feat/p-sift-modified-projections branch from c99811e to 208616f Compare June 2, 2026 23:47
Comment on lines +75 to +76
rank = dist.get_rank() if dist.is_initialized() else 0
world_size = dist.get_world_size() if dist.is_initialized() else 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure you use the correct rank and world size (important for multi node setups)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(it seems like they should be the global ones?)


names = list(side_dims["left"].keys())

per_layer_dim = {n: max(side_dims["left"][n], side_dims["right"][n]) for n in names}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name is somewhat confusing and I am not sure I understand what this is doing. If you want to fairly distribute by cost, then looking at the max is not necessarily correct?

Comment on lines +133 to +138
# create_projection_matrix returns unit-norm rows, so E[RᵀR] =
# (p/d)·I. An unbiased sketch needs E[RᵀR] = I, so rescale by
# √(d/p); otherwise each side carries a p/d factor and the score
# picks up a per-layer p²/(d_left·d_right) reweighting that
# corrupts ranking relative to the exact (projection_dim=0) path.
R = R * (d / projection_dim) ** 0.5

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how relevant that is, we look at relative rankings only anyway

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we can leave it there

grad_buffer.flush()
self.logger.info(f"Saved sketched IVHP gradients to {self.cfg.run_path}")

def _apply_legacy(self):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this legacy?

index_cfg.distributed,
)
else:
# ── Step 3 (legacy): Apply inverse Hessian via rotate-divide-rotate ──

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all these are not legacy, maybe I am missing something here

Comment thread bergson/config.py
processor_path: str = ""
"""Path to a precomputed processor."""

kfac_projection_path: str = ""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we should even make this configurable? Can't imagine a situation where we would need to change it

@LouisYRYJ

Copy link
Copy Markdown
Contributor

Left a few comments, I think we are almost there!

We should also for now raise Error when doing lora + random projections because in that case one side should not be compressed, but we don't handle that right now. Would be a good follow up PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants