MovieLens Ranking

A top-N recommender system built on MovieLens 1M (1M ratings, 6,040 users, 3,706 movies) that implements and compares three approaches — collaborative filtering, content-based filtering, and a hybrid fusion model — against a popularity baseline.

The goal is personalised movie ranking: given a user's rating history, return a ranked list of 10 unseen movies they are likely to enjoy. The system is evaluated end-to-end with standard retrieval metrics (Precision@10, Recall@10, NDCG@10, Coverage) over 4,500 users.

Pipeline

ratings.csv ──► preprocessing ──► train / test split (80/20, per-user)
                                        │
                    ┌───────────────────┤
                    │                   │
               SVD model          TF-IDF model
               (collab.)          (content)
                    │                   │
                    └────────┬──────────┘
                             │
                        Hybrid fusion
                      (weighted blend)
                             │
                        Ranking + Evaluation

Models

Popularity baseline
Recommends the globally most-rated movies to every user. No personalisation. Included as a sanity check — any meaningful recommender should beat it.

SVD — Collaborative Filtering
Factorises the mean-centred user-item matrix using Truncated SVD (R ≈ U · Σ · Vᵀ). The k latent factors capture implicit taste dimensions without explicit labels. Missing values are imputed with a shrinkage-regularised baseline:

imputed(u, i) = μ + (μ_u − μ) + (μ_i_shrunk − μ)

where  μ_i_shrunk = (n_i · μ_i  +  λ · μ) / (n_i + λ)

With λ=25, a movie needs ~25 ratings before its own mean outweighs the global mean. This prevents obscure 5-star movies (1–2 ratings) from dominating recommendations. Movies with fewer than 20 training ratings are excluded from inference for the same reason — their matrix columns are almost entirely imputed signal.

Content-Based Filtering
Represents each movie as a TF-IDF vector of its pipe-separated genre tags. Rare genres (Film-Noir, Western) receive higher weight than common ones (Drama, Comedy). A user taste profile is the rating-weighted mean of the vectors of movies they rated ≥ 3.5 stars. Candidates are ranked by cosine similarity to that profile.

Hybrid
Weighted rank fusion of both models:

hybrid_score(m) = α · minmax(SVD_score) + (1−α) · minmax(content_score)

Each signal is min-max normalised to [0, 1] before blending so they are on the same scale. α=0.5 gives equal weight to both; the effect of α is discussed in the results section.

Dataset & Preprocessing

Property	Value
Dataset	MovieLens 1M
Ratings	1,000,209
Users	6,040
Movies	3,706 rated / 3,883 in catalogue
Rating scale	0.5 – 5.0 stars

Train/test split
Random 80/20 split stratified by user: each user's ratings are split independently so every user appears in both train and test. This avoids cold-start users in evaluation and ensures metrics reflect real personalisation quality rather than popularity bias from users with many ratings.

No temporal split was applied — MovieLens 1M timestamps are not used. This is a known limitation (see below).

Leakage prevention

Models are trained exclusively on train_df
test_df is only accessed during metric computation
Recommendations always exclude movies the user has already rated (exclude_rated=True)

Cold-start
Users with a single rating are placed entirely in train (cannot be split 80/20). They are excluded from ranking evaluation but included in coverage calculation.

Evaluation

Relevance definition
A movie in the test set is considered relevant if the user rated it ≥ 3.5 stars. Using 4.0 (a common default) discards ~40% of valid positive examples in this dataset.

Negative sampling
No explicit negative sampling is performed. The full set of unseen movies is the implicit negative pool. Precision@K and Recall@K are computed against the test-set positives only — unseen movies that were never rated are treated as unknown, not as negatives. This is standard for implicit-feedback-style offline evaluation. Metrics are therefore only comparable within implicit-feedback evaluation settings and should not be interpreted against explicit-rating prediction benchmarks.

Candidate pool
The Hybrid model generates up to 500 candidates per model before blending. The SVD and Content models each rank all unseen movies that pass the min_ratings filter.

Coverage
Computed over all 6,040 eligible users, not just the evaluation sample, so it does not grow artificially with n_eval_users. It represents the fraction of the catalogue that the model ever recommends to anyone. This metric captures global item diversity but does not account for how evenly items are distributed across users.

Results

Evaluated over 4,500 sampled users, K=10, relevance threshold=3.5.

Model	RMSE	Precision@10	Recall@10	NDCG@10	Coverage
Popularity baseline	—	~0.05*	~0.02*	~0.06*	0.003
SVD (k=100)	0.903	0.092	0.046	0.109	0.098
Hybrid (α=0.5)	—	0.094	0.065	0.118	0.343

*Popularity baseline estimates — not exact, included for reference only.

SVD vs Hybrid performance comparison

The Hybrid model consistently improves ranking quality (NDCG, Recall) while significantly increasing catalogue coverage due to the content-based signal.

Coverage (0.343) means the Hybrid recommends movies from 34.3% of the full 3,883-movie catalogue across all users. The SVD alone covers 9.8%. The difference comes from the content model introducing genre-driven diversity that the collaborative model misses.

Effect of α on NDCG@10 (Hybrid):

α	NDCG@10	Coverage
1.0 (SVD only)	0.109	0.098
0.7	0.113	0.180
0.5	0.118	0.343
0.3	0.111	0.401
0.0 (content only)	0.062	0.451

α=0.5 maximises NDCG while keeping coverage high. Below 0.3 the content model dominates and NDCG drops, as genre similarity alone is a weaker ranking signal than user history.

Key improvements over the SVD baseline:

−6.3% RMSE from shrinkage-regularised imputation
+8% NDCG@10 from hybrid fusion
+41% Recall@10 from larger candidate pool (100→500)
3.5× more catalogue coverage

Limitations

No temporal dynamics. MovieLens 1M has timestamps but this project uses a random split. A temporal split (train on older ratings, test on newer) would better simulate real deployment and is known to reduce optimistic bias in offline metrics.
Genre-only content features. The content model only uses pipe-separated genre tags. Movie descriptions, cast, directors, or embeddings from a language model would capture much richer semantic similarity.
Cold-start not explicitly addressed. New users with no rating history cannot be served by the SVD model. The content model handles it partially (profile can be built from a single rated movie), but this is not evaluated.
Offline evaluation gap. Precision@10 and Recall@10 only count hits against movies the user happened to rate in the test set. A movie the user would have loved but never watched counts as a miss — known as the missing-not-at-random problem. Real-world A/B testing would give a fairer picture.
Popularity bias. Despite min_ratings=20, the SVD still favours well-known movies. The Hybrid partially compensates via the content signal.
No significance testing. Metrics are reported from a single run. Confidence intervals across multiple random seeds would make comparisons more robust.

Project structure

├── data/
│   ├── ratings.csv
│   ├── movies.csv
|   ├── README_Dataset.md
│   └── users.csv
├── src/
│   ├── load_data.py           # Data loading and validation
│   ├── preprocessing.py       # Matrix construction, imputation, train/test split
│   ├── evaluation.py          # RMSE, MAE, Precision@K, Recall@K, NDCG@K, Coverage
│   └── recommenders/
│       ├── collaborative.py   # SVD recommender
│       ├── content.py         # TF-IDF content-based recommender
│       └── hybrid.py          # Weighted hybrid
├── recommender_comparison.ipynb   # Interactive experiment tracker
├── run_history.json               # Auto-generated run log
├── main.py                        # Full pipeline CLI
├── requirements.txt
└── README.md

Setup

git clone https://github.com/MartinLiarte/ML-ranking
cd ML-ranking
pip install -r requirements.txt

The MovieLens 1M dataset is included for reproducibility.

Requirements:

numpy
pandas
scipy
scikit-learn
matplotlib
jupyter

Usage

Full pipeline:

python main.py --ratings data/ratings.csv \
               --movies  data/movies.csv  \
               --users   data/users.csv

Key arguments:

Argument	Default	Description
`--n_factors`	100	SVD latent factors
`--alpha`	0.5	Hybrid blend weight (1=SVD only, 0=content only)
`--min_ratings`	20	Min ratings for a movie to be recommended
`--user_id`	1	User to show recommendations for
`--top_n`	10	Number of recommendations
`--no_eval`	False	Skip evaluation (faster)

Experiment tracking:
Open recommender_comparison.ipynb, edit the PARAMS block, and run all cells. Each run is saved to run_history.json and all charts update automatically to compare every experiment side by side.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MovieLens Ranking

Pipeline

Models

Dataset & Preprocessing

Evaluation

Results

SVD vs Hybrid performance comparison

Limitations

Project structure

Setup

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
src		src
LICENSE		LICENSE
README.md		README.md
main.py		main.py
recommender_comparison.ipynb		recommender_comparison.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

MovieLens Ranking

Pipeline

Models

Dataset & Preprocessing

Evaluation

Results

SVD vs Hybrid performance comparison

Limitations

Project structure

Setup

Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages