feat: sparse-aware UserKNN / ItemKNN / SVD (phase 3)#105
Conversation
Closes #77. Three algorithms whose memory profile cared about the dense user-item matrix move to scipy.sparse: - New _SparseMatrixBackedRecommender base in base.py — owns the CSR matrix, the user/item Index views, and the recommend() contract (unknown-user fast path, seen-item mask off the row's nonzero columns, top-n filter). Sibling of _MatrixBackedRecommender; ALS/BPR/ContentBased/TwoTowerCF stay on the dense base. - ItemKNN: sparse cosine_similarity on the transposed matrix, module-private _sparse_top_k_per_row helper for the diagonal-and-trim step, scoring is sim @ user_row.T. - UserKNN: sklearn.neighbors.NearestNeighbors with cosine metric so the full n_users x n_users similarity never gets materialized — the one thing that was forcing the goodbooks benchmark to subsample. - SVD: TruncatedSVD on the CSR matrix directly. It already accepted sparse; the dense pivot was the only thing keeping it from scaling. Local benchmark on MovieLens 100k (tests/benchmarks/test_fit_speed.py): Algorithm Phase 1.1 (dense) Phase 3 (sparse) SVD 46 ms 29 ms UserKNN 74 ms 33 ms ItemKNN 141 ms 85 ms The headline result is qualitative, though: UserKNN/ItemKNN/SVD can now run goodbooks-10k at full scale (53k users x 10k items). The subsample in scripts/benchmark_goodbooks.py remains only because ContentBased/HybridBook still use dense feature matrices — those are out of scope for this phase. Pickles of the affected classes from older versions won't load on the new code (the internal layout shifted from DataFrame to CSR). Pre-1.0; documented in CHANGELOG. ADR in docs/evolution/03-sparse-recommenders.md covers context, the options that were rejected (sparse-everywhere base migration, full sparse similarity for UserKNN, approximate NN, the sparse=True flag pattern), and what's deliberately not in scope (ContentBased remains dense; ALS is fast enough; ANN is a Phase-4+ candidate).
JohnJacob-coder
left a comment
There was a problem hiding this comment.
Reviewed Phase 3 in depth — I'd independently prototyped the same sparse base, so I checked yours against that and it's correct, and your UserKNN solution is better than what I had (I'd kept a dense user-user similarity; NearestNeighbors is the right call).
Verified locally: ruff / ruff format / mypy clean; pytest 133 passed, 0 failed (CI's 137 just reflects torch being present there). Benchmark spot-check on ML-100k matches within hardware variance and confirms the "faster even on small data" claim: SVD 37 ms, UserKNN 35 ms, ItemKNN 89 ms here — all well under the Phase 1.1 dense 46/74/141.
Correctness:
- ItemKNN — sparse cosine (
dense_output=False) +_sparse_top_k_per_row(top-k by value over the row's nonzeros) is equivalent to the old dense top-k; scoring issim @ user_row.T. Good. - UserKNN —
NearestNeighbors(cosine)finds top-k without materializing n_users², drops the self-neighbor (col 0) and converts distance→similarity correctly. - SVD — scores per-user from
user_factors[u] @ components_, so the dense predicted matrix is gone. Mathematically identical to the old reconstruction. - The test diff is purely additive (two sparse-representation pins); the existing recommend-output tests are unchanged and pass, so the public contract holds through the UserKNN approach change.
ADR is honest — it doesn't oversell NearestNeighbors (correctly notes brute-force is still O(users²·items), just bounded memory), defers ANN/FAISS as premature, and scopes ContentBased/ALS out with reasons. The 28-trillion-ops figure checks out.
Non-blocking: UserKNN.fit drops indices[:, 1:] assuming self is the distance-0 first neighbor. If duplicate-vector users tie at distance 0, self may not be column 0 — harmless today (self only scores seen items, which recommend masks), but worth a one-line comment so a future masking change can't quietly reintroduce self-recommendation.
LGTM. (v0.3.0 is the phase tag once this lands.)
Phase 3 — sparse-aware UserKNN / ItemKNN / SVD
Closes #77. Three algorithms whose memory profile cared about the dense user-item matrix move to scipy.sparse so the goodbooks-10k corpus runs at full scale instead of the 2,500-user subsample.
What's in the PR
New
_SparseMatrixBackedRecommenderbase inbase.py— owns the CSR matrix,pd.Indexviews for users and items, and therecommend()contract (unknown-user fast path, seen-item mask off the row's nonzero columns, top-nfilter). Sibling of_MatrixBackedRecommender; ALS / BPR / ContentBased / TwoTowerCF stay on the dense base because they aren't memory-constrained at the scales we target.ItemKNN — sparse cosine on the transposed matrix, then a private
_sparse_top_k_per_rowhelper does the diagonal-and-trim step. Scoring a user issim @ user_row.T.UserKNN —
sklearn.neighbors.NearestNeighbors(metric="cosine")so the fulln_users × n_userssimilarity never gets materialized. That's the one thing that was forcing the goodbooks benchmark to subsample.SVD —
TruncatedSVDon the CSR matrix directly. It already accepted sparse; the dense pivot was the only thing keeping it from scaling.Local benchmark on MovieLens 100k
tests/benchmarks/test_fit_speed.py— also faster on small data because the sparse path beats the dense one even before the memory win kicks in:The headline is qualitative: these three can now run goodbooks-10k at full scale (53k users × 10k items, ~6M interactions). The subsample in
scripts/benchmark_goodbooks.pyremains only becauseContentBased/HybridBookstill use dense feature matrices; that's a separate phase.ADR
docs/evolution/03-sparse-recommenders.mdcovers context, the options that were rejected (sparse-everywhere base migration, full sparse similarity for UserKNN, approximate NN, thesparse=Trueflag pattern), and what's deliberately out of scope.Verification
ruff check/ruff format --check/mypyclean.pytest— 137 passed, 1 skipped (the BPR kernel test skips without the compiled extension), 7 deselected (benchmarks).test_knn_stores_sparse_matrix,test_svd_stores_sparse_matrix) pin the sparse internal representation so a future change can't silently revert to dense.Backwards-incompatible
Pickles of
UserKNN/ItemKNN/SVDwritten before this release don't load on the new code — the internal matrix layout shifted from densepd.DataFrameto sparse CSR. Pre-1.0; documented in CHANGELOG.