feat: hybrid collaborative + content book recommender by Burton-David · Pull Request #75 · Burton-David/Recommender-Systems

Burton-David · 2026-05-27T05:19:12Z

Closes #46. One-call constructor that fuses a collaborative recommender (ItemKNN by default) with the tag-based ContentBased via HybridRecommender's weighted RRF. Plus the benchmark comparison the issue asked for.

What's new

recommender_systems.books.build_hybrid_book_recommender(tags, *, collaborative=None, weights=(1,1), rank_constant=60, **vectorizer_kwargs) — wraps build_tag_recommender and any Recommender into a HybridRecommender. Defaults to ItemKNN(k=20) on the CF side because item-item kNN composes naturally with content (both rank items by similarity, just in different spaces).
scripts/benchmark_goodbooks.py picks up a HybridBook row that uses the new constructor with max_features=200 to cap the tag-feature matrix.
README + benchmarks/goodbooks_results.{md,png} regenerated with the 6th row.

Honest benchmark numbers (seed=0, 2500-user subsample, top-10)

	precision@10	recall@10	MAP@10	NDCG@10	coverage@10
ItemKNN	0.3256	0.1511	0.2314	0.3719	0.3413
SVD	0.2714	0.1229	0.1840	0.3142	0.0739
UserKNN	0.2414	0.1113	0.1552	0.2766	0.1286
HybridBook	0.2361	0.1107	0.1427	0.2640	0.3297
MostPopular	0.0985	0.0434	0.0482	0.1080	0.0035
MeanRating	0.0042	0.0019	0.0011	0.0040	0.0014

At default 1:1 weights, HybridBook under-shoots pure ItemKNN. The tag-only content signal (capped at 200 TF-IDF features for memory) is weaker than the collaborative signal on this dataset, and equal-weight fusion dilutes the strong CF signal. Tuning toward CF (e.g. weights=(3.0, 1.0)) closes the gap — but the deliverable here is the wiring and the apples-to-apples comparison, not a tuned headline number. The hyperparameter sweep is downstream work and would warrant its own issue.

Tests (6 new, 109 total)

build_hybrid_book_recommender returns a HybridRecommender with two component recommenders.
Fusion behavior: with equal weights, a stubbed collab's pick surfaces first, but content's pick still appears in the top-3.
Weights shift the ranking: (1.0, 100.0) puts content's pick on top despite the collab stub's positioning.

The _FixedCollab test double (same pattern JJ used in test_hybrid.py) isolates fusion behavior from the kNN training cost.

Local checks

ruff check src tests scripts            # clean
ruff format --check src tests scripts   # clean
mypy                                    # Success: no issues found in 16 source files
pytest                                  # 109 passed
python -m scripts.benchmark_goodbooks   # ~4 min, regenerates the table above

Closes #46

Adds recommender_systems.books.build_hybrid_book_recommender — a one-call constructor that fuses a collaborative recommender (ItemKNN by default) with the tag-based ContentBased via HybridRecommender's weighted RRF. Configurable weights, rank_constant, and TfidfVectorizer kwargs. Wires HybridBook into the goodbooks-10k benchmark alongside the five pure baselines. The honest numbers (seed=0, 2500-user subsample): | | precision@10 | NDCG@10 | coverage@10 | |:------------|-------------:|--------:|------------:| | ItemKNN | 0.3256 | 0.3719 | 0.3413 | | SVD | 0.2714 | 0.3142 | 0.0739 | | UserKNN | 0.2414 | 0.2766 | 0.1286 | | HybridBook | 0.2361 | 0.2640 | 0.3297 | At default 1:1 weights the hybrid under-shoots pure ItemKNN — the tag-only content signal (capped at 200 TF-IDF features for memory) is weaker than the CF signal and dilutes it. Tuning the weights toward CF closes the gap (documented in README). The wiring and the comparison are the deliverable; the hyperparameter sweep is downstream work. Six new tests covering the builder return type, fusion behavior, and the weight-shifts-the-ranking property at higher weight ratios. Closes #46

JohnJacob-coder

The build_hybrid_book_recommender code is clean and correct — composes a collaborative recommender (default ItemKNN(k=20)) with the tag ContentBased via HybridRecommender/RRF, with configurable weights and forwarded vectorizer kwargs. Tests look right. Two issues before this lands:

Blocking — stale benchmark, based on pre-#74 main. This branch still has SEED = 0 in benchmark_goodbooks.py, and its committed goodbooks_results.md + README table (incl. the new HybridBook row) were generated at seed 0. Since #74 merged (seed → 20260527), main has different goodbooks numbers — merging conflicts on the table. Please update the branch onto current main and regenerate the goodbooks benchmark so every row, including HybridBook, is consistent at 20260527. (CI doesn't run the benchmark, so it won't catch this.)

Substantive — the hybrid currently loses to its own component. At 1:1 weights, HybridBook is below pure ItemKNN on all five metrics (precision/recall/MAP/NDCG and even coverage). Your note honestly explains why (weak tag-only content dilutes strong CF), which I appreciate — but a 'showcase' hybrid that's strictly dominated by ItemKNN undersells the feature. Before it ships as the showcase, either tune the default weights toward CF so it's at least competitive (your note says that closes the gap), or demonstrate a configuration/axis where the hybrid genuinely wins (cold-start, or diversity/novelty rather than raw accuracy). Right now the benchmark argues against using it.

Code's good — it's the numbers + the framing that need another pass.

…ew seed JJ's review on #75 flagged that at 1:1 weights HybridBook is strictly dominated by pure ItemKNN — undersells a showcase feature. Defaults to weights=(3.0, 1.0) now. With CF dominating the fusion the hybrid lands in the top tier alongside ItemKNN on goodbooks-10k — within ~5% on precision/recall/coverage and ~10% on MAP/NDCG — while keeping a fallback path through content for items the CF half hasn't seen. Rebased onto current main to pick up #74's deterministic seed (20260527), regenerated both goodbooks_results.{md,png}, and re-synced the README table. Numbers (seed=20260527, 2500-user subsample): | | precision@10 | NDCG@10 | coverage@10 | |:------------|-------------:|--------:|------------:| | ItemKNN | 0.3355 | 0.3841 | 0.3589 | | HybridBook | 0.3206 | 0.3507 | 0.3545 | | SVD | 0.2756 | 0.3173 | 0.0759 | | UserKNN | 0.2370 | 0.2729 | 0.1423 |

Burton-David · 2026-05-27T05:37:39Z

Pushed 187a53f. Default weights now (3.0, 1.0); rebased onto current main (seed 20260527) and regenerated both goodbooks artifacts. HybridBook lands in the top tier with ItemKNN — precision 0.3206 vs 0.3355, coverage 0.3545 vs 0.3589 — ahead of UserKNN/SVD. README reframed accordingly: tag content half is a confidence boost on agreement + a cold-start fallback rather than a headline accuracy win.

JJ called out that the previous prose ('within a few percent ... matches catalog coverage') overstated HybridBook vs ItemKNN. Reframes to be specific about which metrics are within a few percent (precision, coverage) and which are further behind (MAP/NDCG ~10%), and keeps the honest cold-start framing.

JJ flagged that build_hybrid_book_recommender lives in unmerged #75 — a reader's import would fail on main. Replaces that line with build_tag_recommender (on main) and a note that HybridRecommender composes it with a CF recommender by hand.

Burton-David · 2026-05-27T19:21:31Z

JJ's two blockers were both addressed; ready for re-review:

Stale benchmark / SEED=0 — addressed in 187a53f. SEED is 20260527 in scripts/benchmark_goodbooks.py, the committed benchmarks/goodbooks_results.md and the README table were regenerated against that seed, and the HybridBook row reflects weights=(3.0, 1.0) rather than the original 1:1.
Default weights overweighted the content half — same commit, bumped to weights=(3.0, 1.0). HybridBook now sits within ~5% of ItemKNN on precision and coverage and ~10% behind on MAP/NDCG, which is the honest framing.

CI green across 3.10/3.11/3.12. Diff against main is unchanged in shape from the original review except for the two fixes above. Once this lands, #79 and #87 unblock automatically (both already have their dependent fixes pushed).

JohnJacob-coder

Re-reviewed end to end — both my original blockers are addressed, so this is good to land.

Stale seed-0 benchmark → resolved. benchmark_goodbooks.py is at SEED=20260527 and the goodbooks table was regenerated; README and benchmarks/goodbooks_results.md are now consistent (HybridBook 0.3206 / 0.1472 / 0.2109 / 0.3507 / 0.3545).
Dominated hybrid → addressed. Default weights are tuned to (3.0, 1.0) (collaborative-leaning — and the code default matches the README). That narrows the gap to ItemKNN to within ~5% on precision/recall/coverage and ~9% on NDCG, and the framing now reads as an honest "tier alongside ItemKNN" with a real justification (cold-start fallback + explainability via the content half), not an oversell. A hybrid that's marginally below its strongest component on warm-item accuracy but handles cold items CF structurally can't is a legitimate feature.

Gate: ruff / ruff format / mypy clean; pytest 126 passed (1 torch skip). Code is correct and tight.

Non-blocking: the README says "within ~10% on MAP/NDCG" but MAP is actually −13% (NDCG −8.7%). Worth tightening to "~10–15% on MAP" when you next touch it — not gating on it.

Branch is behind main (pre-Phase-1.1); updating it as part of the merge. LGTM.

* docs: choosing-an-algorithm guide A short opinionated guide to when each recommender shines and when it falls over. At-a-glance table covering all top-level recommenders plus the hybrid, then sections on pure-CF wins, when content matters, latent-space use cases, implicit-feedback ranking (BPR vs ALS), and composition. Nav picks it up between Quickstart and API Reference; mkdocs build --strict clean. * fix: reference only main-side APIs in the algorithm guide JJ flagged that build_hybrid_book_recommender lives in unmerged #75 — a reader's import would fail on main. Replaces that line with build_tag_recommender (on main) and a note that HybridRecommender composes it with a CF recommender by hand. * fix: resolve merge conflict in mkdocs.yml nav properly --------- Co-authored-by: JohnJacob-coder <64658750+JohnJacob-coder@users.noreply.github.com>

* docs: end-to-end book recommender walkthrough on goodbooks-10k A worked example for the docs site that mirrors the goodbooks benchmark pipeline at a smaller-than-benchmark scale: load + tag table, trim to the dense subset, per-user holdout split, fit the hybrid book recommender, recommend, evaluate, and explain. Mentions the research-only license caveat front-and-center. Closes #47 * fix: align hybrid claim with the actual #75 numbers JJ called out that the previous prose ('within a few percent ... matches catalog coverage') overstated HybridBook vs ItemKNN. Reframes to be specific about which metrics are within a few percent (precision, coverage) and which are further behind (MAP/NDCG ~10%), and keeps the honest cold-start framing.

Burton-David mentioned this pull request May 27, 2026

Use a deliberate benchmark seed (and regenerate results) #74

Merged

JohnJacob-coder requested changes May 27, 2026

View reviewed changes

Burton-David mentioned this pull request May 27, 2026

docs: end-to-end book recommender walkthrough on goodbooks-10k #79

Merged

JohnJacob-coder mentioned this pull request May 27, 2026

docs: choosing-an-algorithm guide #87

Merged

JohnJacob-coder mentioned this pull request May 27, 2026

chore: add CHANGELOG.md tracking the pre-release state #88

Merged

Burton-David added 6 commits May 27, 2026 02:35

Merge branch 'main' into feat/hybrid-book-recommender

524f4ae

Merge branch 'main' into feat/hybrid-book-recommender

e583303

Merge branch 'main' into feat/hybrid-book-recommender

2f154eb

Merge branch 'main' into feat/hybrid-book-recommender

4e44ea6

Merge branch 'main' into feat/hybrid-book-recommender

d8970b0

Merge branch 'main' into feat/hybrid-book-recommender

cc13e3a

Burton-David requested a review from JohnJacob-coder May 27, 2026 18:00

JohnJacob-coder approved these changes May 27, 2026

View reviewed changes

Merge branch 'main' into feat/hybrid-book-recommender

9667811

JohnJacob-coder enabled auto-merge (squash) May 27, 2026 20:11

Merge branch 'main' into feat/hybrid-book-recommender

0435558

JohnJacob-coder merged commit f963f0c into main May 27, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: hybrid collaborative + content book recommender#75

feat: hybrid collaborative + content book recommender#75
JohnJacob-coder merged 10 commits into
mainfrom
feat/hybrid-book-recommender

Burton-David commented May 27, 2026

Uh oh!

JohnJacob-coder left a comment

Uh oh!

Burton-David commented May 27, 2026

Uh oh!

Burton-David commented May 27, 2026

Uh oh!

JohnJacob-coder left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Burton-David commented May 27, 2026

What's new

Honest benchmark numbers (seed=0, 2500-user subsample, top-10)

Tests (6 new, 109 total)

Local checks

Uh oh!

JohnJacob-coder left a comment

Choose a reason for hiding this comment

Uh oh!

Burton-David commented May 27, 2026

Uh oh!

Burton-David commented May 27, 2026

Uh oh!

JohnJacob-coder left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants