feat: sparse-aware UserKNN / ItemKNN / SVD (phase 3) by Burton-David · Pull Request #105 · Burton-David/Recommender-Systems

Burton-David · 2026-05-27T21:32:21Z

Phase 3 — sparse-aware UserKNN / ItemKNN / SVD

Closes #77. Three algorithms whose memory profile cared about the dense user-item matrix move to scipy.sparse so the goodbooks-10k corpus runs at full scale instead of the 2,500-user subsample.

What's in the PR

New _SparseMatrixBackedRecommender base in base.py — owns the CSR matrix, pd.Index views for users and items, and the recommend() contract (unknown-user fast path, seen-item mask off the row's nonzero columns, top-n filter). Sibling of _MatrixBackedRecommender; ALS / BPR / ContentBased / TwoTowerCF stay on the dense base because they aren't memory-constrained at the scales we target.
ItemKNN — sparse cosine on the transposed matrix, then a private _sparse_top_k_per_row helper does the diagonal-and-trim step. Scoring a user is sim @ user_row.T.
UserKNN — sklearn.neighbors.NearestNeighbors(metric="cosine") so the full n_users × n_users similarity never gets materialized. That's the one thing that was forcing the goodbooks benchmark to subsample.
SVD — TruncatedSVD on the CSR matrix directly. It already accepted sparse; the dense pivot was the only thing keeping it from scaling.

Local benchmark on MovieLens 100k

tests/benchmarks/test_fit_speed.py — also faster on small data because the sparse path beats the dense one even before the memory win kicks in:

Algorithm	Phase 1.1 (dense)	Phase 3 (sparse)
SVD	46 ms	29 ms
UserKNN	74 ms	33 ms
ItemKNN	141 ms	85 ms

The headline is qualitative: these three can now run goodbooks-10k at full scale (53k users × 10k items, ~6M interactions). The subsample in scripts/benchmark_goodbooks.py remains only because ContentBased / HybridBook still use dense feature matrices; that's a separate phase.

ADR

docs/evolution/03-sparse-recommenders.md covers context, the options that were rejected (sparse-everywhere base migration, full sparse similarity for UserKNN, approximate NN, the sparse=True flag pattern), and what's deliberately out of scope.

Verification

ruff check / ruff format --check / mypy clean.
pytest — 137 passed, 1 skipped (the BPR kernel test skips without the compiled extension), 7 deselected (benchmarks).
Two new regression tests (test_knn_stores_sparse_matrix, test_svd_stores_sparse_matrix) pin the sparse internal representation so a future change can't silently revert to dense.

Backwards-incompatible

Pickles of UserKNN / ItemKNN / SVD written before this release don't load on the new code — the internal matrix layout shifted from dense pd.DataFrame to sparse CSR. Pre-1.0; documented in CHANGELOG.

Closes #77. Three algorithms whose memory profile cared about the dense user-item matrix move to scipy.sparse: - New _SparseMatrixBackedRecommender base in base.py — owns the CSR matrix, the user/item Index views, and the recommend() contract (unknown-user fast path, seen-item mask off the row's nonzero columns, top-n filter). Sibling of _MatrixBackedRecommender; ALS/BPR/ContentBased/TwoTowerCF stay on the dense base. - ItemKNN: sparse cosine_similarity on the transposed matrix, module-private _sparse_top_k_per_row helper for the diagonal-and-trim step, scoring is sim @ user_row.T. - UserKNN: sklearn.neighbors.NearestNeighbors with cosine metric so the full n_users x n_users similarity never gets materialized — the one thing that was forcing the goodbooks benchmark to subsample. - SVD: TruncatedSVD on the CSR matrix directly. It already accepted sparse; the dense pivot was the only thing keeping it from scaling. Local benchmark on MovieLens 100k (tests/benchmarks/test_fit_speed.py): Algorithm Phase 1.1 (dense) Phase 3 (sparse) SVD 46 ms 29 ms UserKNN 74 ms 33 ms ItemKNN 141 ms 85 ms The headline result is qualitative, though: UserKNN/ItemKNN/SVD can now run goodbooks-10k at full scale (53k users x 10k items). The subsample in scripts/benchmark_goodbooks.py remains only because ContentBased/HybridBook still use dense feature matrices — those are out of scope for this phase. Pickles of the affected classes from older versions won't load on the new code (the internal layout shifted from DataFrame to CSR). Pre-1.0; documented in CHANGELOG. ADR in docs/evolution/03-sparse-recommenders.md covers context, the options that were rejected (sparse-everywhere base migration, full sparse similarity for UserKNN, approximate NN, the sparse=True flag pattern), and what's deliberately not in scope (ContentBased remains dense; ALS is fast enough; ANN is a Phase-4+ candidate).

JohnJacob-coder

Reviewed Phase 3 in depth — I'd independently prototyped the same sparse base, so I checked yours against that and it's correct, and your UserKNN solution is better than what I had (I'd kept a dense user-user similarity; NearestNeighbors is the right call).

Verified locally: ruff / ruff format / mypy clean; pytest 133 passed, 0 failed (CI's 137 just reflects torch being present there). Benchmark spot-check on ML-100k matches within hardware variance and confirms the "faster even on small data" claim: SVD 37 ms, UserKNN 35 ms, ItemKNN 89 ms here — all well under the Phase 1.1 dense 46/74/141.

Correctness:

ItemKNN — sparse cosine (dense_output=False) + _sparse_top_k_per_row (top-k by value over the row's nonzeros) is equivalent to the old dense top-k; scoring is sim @ user_row.T. Good.
UserKNN — NearestNeighbors(cosine) finds top-k without materializing n_users², drops the self-neighbor (col 0) and converts distance→similarity correctly.
SVD — scores per-user from user_factors[u] @ components_, so the dense predicted matrix is gone. Mathematically identical to the old reconstruction.
The test diff is purely additive (two sparse-representation pins); the existing recommend-output tests are unchanged and pass, so the public contract holds through the UserKNN approach change.

ADR is honest — it doesn't oversell NearestNeighbors (correctly notes brute-force is still O(users²·items), just bounded memory), defers ANN/FAISS as premature, and scopes ContentBased/ALS out with reasons. The 28-trillion-ops figure checks out.

Non-blocking: UserKNN.fit drops indices[:, 1:] assuming self is the distance-0 first neighbor. If duplicate-vector users tie at distance 0, self may not be column 0 — harmless today (self only scores seen items, which recommend masks), but worth a one-line comment so a future masking change can't quietly reintroduce self-recommendation.

LGTM. (v0.3.0 is the phase tag once this lands.)

Burton-David mentioned this pull request May 27, 2026

docs: ADR for phase 4 — TwoTowerCF stays in PyTorch (deliberate) #106

Merged

JohnJacob-coder mentioned this pull request May 27, 2026

feat: scipy-sparse user-item matrices to scale to the full corpus #77

Closed

JohnJacob-coder approved these changes May 27, 2026

View reviewed changes

JohnJacob-coder merged commit 1bda782 into main May 27, 2026
3 checks passed

JohnJacob-coder deleted the feat/phase-3-sparse-recommenders branch May 27, 2026 22:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: sparse-aware UserKNN / ItemKNN / SVD (phase 3)#105

feat: sparse-aware UserKNN / ItemKNN / SVD (phase 3)#105
JohnJacob-coder merged 1 commit into
mainfrom
feat/phase-3-sparse-recommenders

Burton-David commented May 27, 2026

Uh oh!

JohnJacob-coder left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Burton-David commented May 27, 2026

Phase 3 — sparse-aware UserKNN / ItemKNN / SVD

What's in the PR

Local benchmark on MovieLens 100k

ADR

Verification

Backwards-incompatible

Uh oh!

JohnJacob-coder left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants