Skip to content

feat(quant): implement NF4 quantization (fixes dangling --nf4 ImportError)#66

Merged
wesleyscholl merged 1 commit into
mainfrom
claude/implement-nf4-quant
Jun 18, 2026
Merged

feat(quant): implement NF4 quantization (fixes dangling --nf4 ImportError)#66
wesleyscholl merged 1 commit into
mainfrom
claude/implement-nf4-quant

Conversation

@konjoinfinity

Copy link
Copy Markdown
Collaborator

Summary

Implements the missing nf4_quant module, completing the previously-dangling --nf4 quantization feature.

convert.py exposes a user-facing --nf4 flag (and an --ultra mode that enables it), and loader_utils reads an __nf4 tensor format — but both import squish.quant.nf4_quant, a module that never existed in the repo's history. So squish convert --nf4 … crashed with an opaque ImportError, and the loader would crash on any NF4-format weight. The feature was wired end-to-end except for the codebook itself.

What's implemented

Canonical QLoRA / bitsandbytes NF4 codebook (16 levels at the quantiles of a unit normal) with per-group absmax scaling:

  • quantize_nf4(W, group_size) -> (packed uint8 (n, d//2), scales f32 (n, n_groups)) — normalize each group to [-1, 1] by absmax, snap to the nearest of the 16 NF4 levels, nibble-pack two indices per byte (low nibble = even col, high = odd).
  • dequantize_nf4(packed, scales, group_size) -> f32 (n, d) — unpack, codebook lookup, broadcast per-group scales.

Contract correctness (no group-size foot-gun)

The caller (_pick_int4_group_size) always returns a group_size that evenly divides d, and d is even — so packed is exactly (n, d//2), scales is (n, d//gs), and the reader recovers gs exactly as (packed.shape[1]*2) // scales.shape[1]. No padding, and crucially no re-derivation mismatch of the kind fixed in #60/#65 (the round-trip audit that surfaced this feature gap). Validation rejects non-divisible / odd-column / 1-D inputs rather than silently corrupting.

Validation

  • Verified end-to-end through convert.quantize_tensor(use_nf4=True) (produces {__nf4, __s_nf4, __shape}) and loader_utils._dequantize_npy (derives gs, dequantizes, reshapes) — ~0.086 relative error on Gaussian weights, typical for 4-bit.
  • +17 tests: codebook properties, writer/reader shape contract, reader gs derivation, round-trip accuracy, nibble-packing order, per-group scale independence, zero-group safety, input validation.
  • Module census 106 → 107.
  • CI=1 full suite: 4144 passed, 277 skipped; ruff check squish/ clean.

Context

This came out of the group-size audit after #60/#65: while confirming the remaining // n_groups sites were safe, the NF4 reader path turned out to reference a non-existent module — a user-reachable crash. Per your call, implemented it (vs. removing the flag or failing fast).

🤖 Generated with Claude Code


Generated by Claude Code

convert.py exposed a user-facing `--nf4` flag (and an `--ultra` mode that
enables it), and loader_utils read an `__nf4` tensor format, but both imported
`squish.quant.nf4_quant` — a module that never existed. Any `squish convert
--nf4 ...` crashed with an opaque ImportError, and the loader would crash on any
NF4-format weight. The feature was wired end-to-end except for the codebook
itself.

Implement the missing module with the canonical QLoRA/bitsandbytes NF4 codebook
(16 levels at the quantiles of a unit normal) and per-group absmax scaling:

- quantize_nf4(W, group_size) -> (packed uint8 (n, d//2), scales f32 (n, n_groups))
  Normalizes each group to [-1,1] by absmax, snaps to the nearest of 16 NF4
  levels, nibble-packs two indices per byte (low nibble = even col, high = odd).
- dequantize_nf4(packed, scales, group_size) -> f32 (n, d)
  Unpacks, looks up the codebook, broadcasts per-group scales.

Matches the exact writer/reader contract: the caller picks a group_size that
evenly divides d (and d is even), so packed is (n, d//2), scales is (n, d//gs),
and the reader recovers gs exactly as (packed.shape[1]*2)//scales.shape[1] — no
padding, no group-size mis-derivation.

Verified end-to-end through convert.quantize_tensor(use_nf4=True) and
loader_utils._dequantize_npy: ~0.086 relative error on Gaussian weights (typical
for 4-bit). +17 tests (codebook, contract, round-trip, packing order, per-group
scale independence, validation). Module census 106 -> 107.

CI-mode suite: 4144 passed, 277 skipped. ruff clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01W8bTep4nw7ybFHhx7QjzMv
@wesleyscholl wesleyscholl marked this pull request as ready for review June 18, 2026 13:45
@wesleyscholl wesleyscholl merged commit dc64115 into main Jun 18, 2026
18 checks passed
@wesleyscholl wesleyscholl deleted the claude/implement-nf4-quant branch June 18, 2026 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants