Skip to content

feat: sparse-aware UserKNN / ItemKNN / SVD (phase 3)#105

Merged
JohnJacob-coder merged 1 commit into
mainfrom
feat/phase-3-sparse-recommenders
May 27, 2026
Merged

feat: sparse-aware UserKNN / ItemKNN / SVD (phase 3)#105
JohnJacob-coder merged 1 commit into
mainfrom
feat/phase-3-sparse-recommenders

Conversation

@Burton-David

Copy link
Copy Markdown
Owner

Phase 3 — sparse-aware UserKNN / ItemKNN / SVD

Closes #77. Three algorithms whose memory profile cared about the dense user-item matrix move to scipy.sparse so the goodbooks-10k corpus runs at full scale instead of the 2,500-user subsample.

What's in the PR

  • New _SparseMatrixBackedRecommender base in base.py — owns the CSR matrix, pd.Index views for users and items, and the recommend() contract (unknown-user fast path, seen-item mask off the row's nonzero columns, top-n filter). Sibling of _MatrixBackedRecommender; ALS / BPR / ContentBased / TwoTowerCF stay on the dense base because they aren't memory-constrained at the scales we target.

  • ItemKNN — sparse cosine on the transposed matrix, then a private _sparse_top_k_per_row helper does the diagonal-and-trim step. Scoring a user is sim @ user_row.T.

  • UserKNNsklearn.neighbors.NearestNeighbors(metric="cosine") so the full n_users × n_users similarity never gets materialized. That's the one thing that was forcing the goodbooks benchmark to subsample.

  • SVDTruncatedSVD on the CSR matrix directly. It already accepted sparse; the dense pivot was the only thing keeping it from scaling.

Local benchmark on MovieLens 100k

tests/benchmarks/test_fit_speed.py — also faster on small data because the sparse path beats the dense one even before the memory win kicks in:

Algorithm Phase 1.1 (dense) Phase 3 (sparse)
SVD 46 ms 29 ms
UserKNN 74 ms 33 ms
ItemKNN 141 ms 85 ms

The headline is qualitative: these three can now run goodbooks-10k at full scale (53k users × 10k items, ~6M interactions). The subsample in scripts/benchmark_goodbooks.py remains only because ContentBased / HybridBook still use dense feature matrices; that's a separate phase.

ADR

docs/evolution/03-sparse-recommenders.md covers context, the options that were rejected (sparse-everywhere base migration, full sparse similarity for UserKNN, approximate NN, the sparse=True flag pattern), and what's deliberately out of scope.

Verification

  • ruff check / ruff format --check / mypy clean.
  • pytest — 137 passed, 1 skipped (the BPR kernel test skips without the compiled extension), 7 deselected (benchmarks).
  • Two new regression tests (test_knn_stores_sparse_matrix, test_svd_stores_sparse_matrix) pin the sparse internal representation so a future change can't silently revert to dense.

Backwards-incompatible

Pickles of UserKNN / ItemKNN / SVD written before this release don't load on the new code — the internal matrix layout shifted from dense pd.DataFrame to sparse CSR. Pre-1.0; documented in CHANGELOG.

Closes #77.

Three algorithms whose memory profile cared about the dense user-item
matrix move to scipy.sparse:

- New _SparseMatrixBackedRecommender base in base.py — owns the CSR
  matrix, the user/item Index views, and the recommend() contract
  (unknown-user fast path, seen-item mask off the row's nonzero
  columns, top-n filter). Sibling of _MatrixBackedRecommender;
  ALS/BPR/ContentBased/TwoTowerCF stay on the dense base.

- ItemKNN: sparse cosine_similarity on the transposed matrix,
  module-private _sparse_top_k_per_row helper for the diagonal-and-trim
  step, scoring is sim @ user_row.T.

- UserKNN: sklearn.neighbors.NearestNeighbors with cosine metric so
  the full n_users x n_users similarity never gets materialized — the
  one thing that was forcing the goodbooks benchmark to subsample.

- SVD: TruncatedSVD on the CSR matrix directly. It already accepted
  sparse; the dense pivot was the only thing keeping it from scaling.

Local benchmark on MovieLens 100k (tests/benchmarks/test_fit_speed.py):

  Algorithm    Phase 1.1 (dense)    Phase 3 (sparse)
  SVD          46 ms                29 ms
  UserKNN      74 ms                33 ms
  ItemKNN      141 ms               85 ms

The headline result is qualitative, though: UserKNN/ItemKNN/SVD can
now run goodbooks-10k at full scale (53k users x 10k items). The
subsample in scripts/benchmark_goodbooks.py remains only because
ContentBased/HybridBook still use dense feature matrices — those are
out of scope for this phase.

Pickles of the affected classes from older versions won't load on the
new code (the internal layout shifted from DataFrame to CSR). Pre-1.0;
documented in CHANGELOG.

ADR in docs/evolution/03-sparse-recommenders.md covers context, the
options that were rejected (sparse-everywhere base migration, full
sparse similarity for UserKNN, approximate NN, the sparse=True flag
pattern), and what's deliberately not in scope (ContentBased remains
dense; ALS is fast enough; ANN is a Phase-4+ candidate).

@JohnJacob-coder JohnJacob-coder left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed Phase 3 in depth — I'd independently prototyped the same sparse base, so I checked yours against that and it's correct, and your UserKNN solution is better than what I had (I'd kept a dense user-user similarity; NearestNeighbors is the right call).

Verified locally: ruff / ruff format / mypy clean; pytest 133 passed, 0 failed (CI's 137 just reflects torch being present there). Benchmark spot-check on ML-100k matches within hardware variance and confirms the "faster even on small data" claim: SVD 37 ms, UserKNN 35 ms, ItemKNN 89 ms here — all well under the Phase 1.1 dense 46/74/141.

Correctness:

  • ItemKNN — sparse cosine (dense_output=False) + _sparse_top_k_per_row (top-k by value over the row's nonzeros) is equivalent to the old dense top-k; scoring is sim @ user_row.T. Good.
  • UserKNN — NearestNeighbors(cosine) finds top-k without materializing n_users², drops the self-neighbor (col 0) and converts distance→similarity correctly.
  • SVD — scores per-user from user_factors[u] @ components_, so the dense predicted matrix is gone. Mathematically identical to the old reconstruction.
  • The test diff is purely additive (two sparse-representation pins); the existing recommend-output tests are unchanged and pass, so the public contract holds through the UserKNN approach change.

ADR is honest — it doesn't oversell NearestNeighbors (correctly notes brute-force is still O(users²·items), just bounded memory), defers ANN/FAISS as premature, and scopes ContentBased/ALS out with reasons. The 28-trillion-ops figure checks out.

Non-blocking: UserKNN.fit drops indices[:, 1:] assuming self is the distance-0 first neighbor. If duplicate-vector users tie at distance 0, self may not be column 0 — harmless today (self only scores seen items, which recommend masks), but worth a one-line comment so a future masking change can't quietly reintroduce self-recommendation.

LGTM. (v0.3.0 is the phase tag once this lands.)

@JohnJacob-coder JohnJacob-coder merged commit 1bda782 into main May 27, 2026
3 checks passed
@JohnJacob-coder JohnJacob-coder deleted the feat/phase-3-sparse-recommenders branch May 27, 2026 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: scipy-sparse user-item matrices to scale to the full corpus

2 participants