feat(quant): implement NF4 quantization (fixes dangling --nf4 ImportError)#66
Merged
Conversation
convert.py exposed a user-facing `--nf4` flag (and an `--ultra` mode that enables it), and loader_utils read an `__nf4` tensor format, but both imported `squish.quant.nf4_quant` — a module that never existed. Any `squish convert --nf4 ...` crashed with an opaque ImportError, and the loader would crash on any NF4-format weight. The feature was wired end-to-end except for the codebook itself. Implement the missing module with the canonical QLoRA/bitsandbytes NF4 codebook (16 levels at the quantiles of a unit normal) and per-group absmax scaling: - quantize_nf4(W, group_size) -> (packed uint8 (n, d//2), scales f32 (n, n_groups)) Normalizes each group to [-1,1] by absmax, snaps to the nearest of 16 NF4 levels, nibble-packs two indices per byte (low nibble = even col, high = odd). - dequantize_nf4(packed, scales, group_size) -> f32 (n, d) Unpacks, looks up the codebook, broadcasts per-group scales. Matches the exact writer/reader contract: the caller picks a group_size that evenly divides d (and d is even), so packed is (n, d//2), scales is (n, d//gs), and the reader recovers gs exactly as (packed.shape[1]*2)//scales.shape[1] — no padding, no group-size mis-derivation. Verified end-to-end through convert.quantize_tensor(use_nf4=True) and loader_utils._dequantize_npy: ~0.086 relative error on Gaussian weights (typical for 4-bit). +17 tests (codebook, contract, round-trip, packing order, per-group scale independence, validation). Module census 106 -> 107. CI-mode suite: 4144 passed, 277 skipped. ruff clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01W8bTep4nw7ybFHhx7QjzMv
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the missing
nf4_quantmodule, completing the previously-dangling--nf4quantization feature.convert.pyexposes a user-facing--nf4flag (and an--ultramode that enables it), andloader_utilsreads an__nf4tensor format — but bothimport squish.quant.nf4_quant, a module that never existed in the repo's history. Sosquish convert --nf4 …crashed with an opaqueImportError, and the loader would crash on any NF4-format weight. The feature was wired end-to-end except for the codebook itself.What's implemented
Canonical QLoRA / bitsandbytes NF4 codebook (16 levels at the quantiles of a unit normal) with per-group absmax scaling:
quantize_nf4(W, group_size) -> (packed uint8 (n, d//2), scales f32 (n, n_groups))— normalize each group to[-1, 1]by absmax, snap to the nearest of the 16 NF4 levels, nibble-pack two indices per byte (low nibble = even col, high = odd).dequantize_nf4(packed, scales, group_size) -> f32 (n, d)— unpack, codebook lookup, broadcast per-group scales.Contract correctness (no group-size foot-gun)
The caller (
_pick_int4_group_size) always returns agroup_sizethat evenly dividesd, anddis even — sopackedis exactly(n, d//2),scalesis(n, d//gs), and the reader recoversgsexactly as(packed.shape[1]*2) // scales.shape[1]. No padding, and crucially no re-derivation mismatch of the kind fixed in #60/#65 (the round-trip audit that surfaced this feature gap). Validation rejects non-divisible / odd-column / 1-D inputs rather than silently corrupting.Validation
convert.quantize_tensor(use_nf4=True)(produces{__nf4, __s_nf4, __shape}) andloader_utils._dequantize_npy(derivesgs, dequantizes, reshapes) — ~0.086 relative error on Gaussian weights, typical for 4-bit.gsderivation, round-trip accuracy, nibble-packing order, per-group scale independence, zero-group safety, input validation.CI=1full suite: 4144 passed, 277 skipped;ruff check squish/clean.Context
This came out of the group-size audit after #60/#65: while confirming the remaining
// n_groupssites were safe, the NF4 reader path turned out to reference a non-existent module — a user-reachable crash. Per your call, implemented it (vs. removing the flag or failing fast).🤖 Generated with Claude Code
Generated by Claude Code