A top-N recommender system built on MovieLens 1M (1M ratings, 6,040 users, 3,706 movies) that implements and compares three approaches — collaborative filtering, content-based filtering, and a hybrid fusion model — against a popularity baseline.
The goal is personalised movie ranking: given a user's rating history, return a ranked list of 10 unseen movies they are likely to enjoy. The system is evaluated end-to-end with standard retrieval metrics (Precision@10, Recall@10, NDCG@10, Coverage) over 4,500 users.
ratings.csv ──► preprocessing ──► train / test split (80/20, per-user)
│
┌───────────────────┤
│ │
SVD model TF-IDF model
(collab.) (content)
│ │
└────────┬──────────┘
│
Hybrid fusion
(weighted blend)
│
Ranking + Evaluation
Popularity baseline
Recommends the globally most-rated movies to every user. No personalisation. Included as a sanity check — any meaningful recommender should beat it.
SVD — Collaborative Filtering
Factorises the mean-centred user-item matrix using Truncated SVD (R ≈ U · Σ · Vᵀ). The k latent factors capture implicit taste dimensions without explicit labels. Missing values are imputed with a shrinkage-regularised baseline:
imputed(u, i) = μ + (μ_u − μ) + (μ_i_shrunk − μ)
where μ_i_shrunk = (n_i · μ_i + λ · μ) / (n_i + λ)
With λ=25, a movie needs ~25 ratings before its own mean outweighs the global mean. This prevents obscure 5-star movies (1–2 ratings) from dominating recommendations. Movies with fewer than 20 training ratings are excluded from inference for the same reason — their matrix columns are almost entirely imputed signal.
Content-Based Filtering
Represents each movie as a TF-IDF vector of its pipe-separated genre tags. Rare genres (Film-Noir, Western) receive higher weight than common ones (Drama, Comedy). A user taste profile is the rating-weighted mean of the vectors of movies they rated ≥ 3.5 stars. Candidates are ranked by cosine similarity to that profile.
Hybrid
Weighted rank fusion of both models:
hybrid_score(m) = α · minmax(SVD_score) + (1−α) · minmax(content_score)
Each signal is min-max normalised to [0, 1] before blending so they are on the same scale. α=0.5 gives equal weight to both; the effect of α is discussed in the results section.
| Property | Value |
|---|---|
| Dataset | MovieLens 1M |
| Ratings | 1,000,209 |
| Users | 6,040 |
| Movies | 3,706 rated / 3,883 in catalogue |
| Rating scale | 0.5 – 5.0 stars |
Train/test split
Random 80/20 split stratified by user: each user's ratings are split independently so every user appears in both train and test. This avoids cold-start users in evaluation and ensures metrics reflect real personalisation quality rather than popularity bias from users with many ratings.
No temporal split was applied — MovieLens 1M timestamps are not used. This is a known limitation (see below).
Leakage prevention
- Models are trained exclusively on
train_df test_dfis only accessed during metric computation- Recommendations always exclude movies the user has already rated (
exclude_rated=True)
Cold-start
Users with a single rating are placed entirely in train (cannot be split 80/20). They are excluded from ranking evaluation but included in coverage calculation.
Relevance definition
A movie in the test set is considered relevant if the user rated it ≥ 3.5 stars. Using 4.0 (a common default) discards ~40% of valid positive examples in this dataset.
Negative sampling
No explicit negative sampling is performed. The full set of unseen movies is the implicit negative pool. Precision@K and Recall@K are computed against the test-set positives only — unseen movies that were never rated are treated as unknown, not as negatives. This is standard for implicit-feedback-style offline evaluation. Metrics are therefore only comparable within implicit-feedback evaluation settings and should not be interpreted against explicit-rating prediction benchmarks.
Candidate pool
The Hybrid model generates up to 500 candidates per model before blending. The SVD and Content models each rank all unseen movies that pass the min_ratings filter.
Coverage
Computed over all 6,040 eligible users, not just the evaluation sample, so it does not grow artificially with n_eval_users. It represents the fraction of the catalogue that the model ever recommends to anyone. This metric captures global item diversity but does not account for how evenly items are distributed across users.
Evaluated over 4,500 sampled users, K=10, relevance threshold=3.5.
| Model | RMSE | Precision@10 | Recall@10 | NDCG@10 | Coverage |
|---|---|---|---|---|---|
| Popularity baseline | — | ~0.05* | ~0.02* | ~0.06* | 0.003 |
| SVD (k=100) | 0.903 | 0.092 | 0.046 | 0.109 | 0.098 |
| Hybrid (α=0.5) | — | 0.094 | 0.065 | 0.118 | 0.343 |
*Popularity baseline estimates — not exact, included for reference only.
The Hybrid model consistently improves ranking quality (NDCG, Recall) while significantly increasing catalogue coverage due to the content-based signal.
Coverage (0.343) means the Hybrid recommends movies from 34.3% of the full 3,883-movie catalogue across all users. The SVD alone covers 9.8%. The difference comes from the content model introducing genre-driven diversity that the collaborative model misses.
Effect of α on NDCG@10 (Hybrid):
| α | NDCG@10 | Coverage |
|---|---|---|
| 1.0 (SVD only) | 0.109 | 0.098 |
| 0.7 | 0.113 | 0.180 |
| 0.5 | 0.118 | 0.343 |
| 0.3 | 0.111 | 0.401 |
| 0.0 (content only) | 0.062 | 0.451 |
α=0.5 maximises NDCG while keeping coverage high. Below 0.3 the content model dominates and NDCG drops, as genre similarity alone is a weaker ranking signal than user history.
Key improvements over the SVD baseline:
- −6.3% RMSE from shrinkage-regularised imputation
- +8% NDCG@10 from hybrid fusion
- +41% Recall@10 from larger candidate pool (100→500)
- 3.5× more catalogue coverage
- No temporal dynamics. MovieLens 1M has timestamps but this project uses a random split. A temporal split (train on older ratings, test on newer) would better simulate real deployment and is known to reduce optimistic bias in offline metrics.
- Genre-only content features. The content model only uses pipe-separated genre tags. Movie descriptions, cast, directors, or embeddings from a language model would capture much richer semantic similarity.
- Cold-start not explicitly addressed. New users with no rating history cannot be served by the SVD model. The content model handles it partially (profile can be built from a single rated movie), but this is not evaluated.
- Offline evaluation gap. Precision@10 and Recall@10 only count hits against movies the user happened to rate in the test set. A movie the user would have loved but never watched counts as a miss — known as the missing-not-at-random problem. Real-world A/B testing would give a fairer picture.
- Popularity bias. Despite
min_ratings=20, the SVD still favours well-known movies. The Hybrid partially compensates via the content signal. - No significance testing. Metrics are reported from a single run. Confidence intervals across multiple random seeds would make comparisons more robust.
├── data/
│ ├── ratings.csv
│ ├── movies.csv
| ├── README_Dataset.md
│ └── users.csv
├── src/
│ ├── load_data.py # Data loading and validation
│ ├── preprocessing.py # Matrix construction, imputation, train/test split
│ ├── evaluation.py # RMSE, MAE, Precision@K, Recall@K, NDCG@K, Coverage
│ └── recommenders/
│ ├── collaborative.py # SVD recommender
│ ├── content.py # TF-IDF content-based recommender
│ └── hybrid.py # Weighted hybrid
├── recommender_comparison.ipynb # Interactive experiment tracker
├── run_history.json # Auto-generated run log
├── main.py # Full pipeline CLI
├── requirements.txt
└── README.md
git clone https://github.com/MartinLiarte/ML-ranking
cd ML-ranking
pip install -r requirements.txtThe MovieLens 1M dataset is included for reproducibility.
Requirements:
numpy
pandas
scipy
scikit-learn
matplotlib
jupyter
Full pipeline:
python main.py --ratings data/ratings.csv \
--movies data/movies.csv \
--users data/users.csvKey arguments:
| Argument | Default | Description |
|---|---|---|
--n_factors |
100 | SVD latent factors |
--alpha |
0.5 | Hybrid blend weight (1=SVD only, 0=content only) |
--min_ratings |
20 | Min ratings for a movie to be recommended |
--user_id |
1 | User to show recommendations for |
--top_n |
10 | Number of recommendations |
--no_eval |
False | Skip evaluation (faster) |
Experiment tracking:
Open recommender_comparison.ipynb, edit the PARAMS block, and run all cells. Each run is saved to run_history.json and all charts update automatically to compare every experiment side by side.