Video Link - https://youtu.be/2s7LOVB7iKo
Given a partially-observed playlist, predict which songs are missing from it. We frame this as a ranking problem: given a set of observed songs, return a ranked list of candidates and evaluate using HitRate, Recall, NDCG, and MRR at K.
Everything is driven by the Makefile. From a fresh clone:
make install # pip install all dependencies
make all # download data, clean, process, train all 6 models, summarize, build interactive vizmake all chains the full pipeline. To run individual stages:
| Command | What it does |
|---|---|
make install |
pip install -r requirements.txt |
make data |
Download spotify_dataset.csv via download_data.py (~1.18 GB, requires Kaggle creds) |
make clean-data |
Run data_cleaning.py → data/spotify_dataset_clean.csv |
make process |
Run data_processing.py → data/spotify/{playlists_train,playlists_test_seen,playlists_test_hidden,song_meta_no_duplicates}.csv |
make explore |
Run explore_data.py → regenerates the static PNGs in figures/ |
make models |
Run all six rec_*.py scripts → results/<tag>_metrics.json + results/<tag>_recs.csv per model |
make summarize |
Aggregate per-model JSONs into results/summary.csv and results/summary.md |
make viz |
Build the interactive Plotly chart at figures/results_interactive.html |
make test |
Run unit + smoke tests via pytest |
make clean |
Remove derived artifacts (keeps the raw 1.18 GB Kaggle download) |
Kaggle credentials. make data uses kagglehub, which expects your Kaggle API token at ~/.kaggle/kaggle.json (see Kaggle's API docs).
Colab alternative. Upload main.ipynb and spotify_dataset.csv to Colab, set COLAB_MODE=True, and Runtime → Run all. The notebook is self-contained and reproduces the same results without the Makefile.
make testRuns three test modules:
- tests/test_preprocessing.py — pure unit tests for
normalize_text, masking logic, popularity ranking, and co-occurrence scoring. No external data needed. - tests/test_preprocessing2.py — extracts target functions from preprocessing.ipynb via
nbformatand tests their behaviour (CSV cleaning, malformed-row handling, evaluation metrics). - tests/test_rec_modules.py — smoke tests that build a tiny synthetic dataset and run rec_pop.py and rec_cooc.py end-to-end, asserting that each writes valid metrics.
CI runs the same make install && make test on every push and PR — see .github/workflows/test.yml.
| File | Type |
|---|---|
| figures/results_interactive.html | Interactive Plotly bar chart of all model metrics — open in any browser |
| figures/playlist_size.png | Distribution of playlist sizes |
| figures/playlists_per_user.png | Number of playlists per user |
| figures/top_artists.png | Top artists by track count |
The interactive chart is regenerated by make viz from results/summary.csv. The static PNGs are regenerated by make explore from the cleaned dataset.
Source: Spotify Playlists — Kaggle
Raw size: 617K rows of (user_id, artistname, trackname, playlistname) tuples scraped from Spotify's public playlist API.
After filtering: ~337K rows / 86.6K songs / 9,296 playlists.
- Download — download_data.py fetches the raw CSV via kagglehub.
- Clean — data_cleaning.py parses the malformed CSV (unescaped quotes inside fields), strips whitespace, drops nulls, and writes
data/spotify_dataset_clean.csv. The "load lenient" routine bypasses pandas' CSV escaping by splitting on the literal","separator and recovers ~280K rows pandas would otherwise drop. - Process — data_processing.py filters out songs that appear in fewer than
MIN_SONG_FREQ=3playlists (typo / parsing artifacts with no co-occurrence signal), holds outTEST_PLAYLIST_FRAC=0.2of eligible playlists for evaluation, and within each test playlist masksHIDDEN_SONG_FRAC=0.2of its tracks. Writes train/test_seen/test_hidden CSVs and a song_id ↔ (artist, track) lookup.RANDOM_SEED=42makes the split reproducible. - Explore — explore_data.py renders cardinality summaries and the three PNG distributions in figures/.
| # | Model | Source | Idea |
|---|---|---|---|
| 1 | Popularity | rec_pop.py | Recommend globally most-played songs to every user. Null model — any useful recommender must beat it. |
| 2 | Co-occurrence | rec_cooc.py | For each seed song, aggregate songs that co-appear in training playlists. Score = raw count. |
| 3 | BM25 Co-occurrence | rec_bm25.py | BM25 (k1=1.5, b=0.75) weighting over the co-occurrence matrix: penalises seeds from very large playlists, rewards seeds supported by many small independent playlists. |
| 4 | ALS | rec_als.py | Alternating Least Squares with SVD warm-start, 32 latent factors, 10 iterations. Pure NumPy + SciPy — no compiled extensions. |
| 5 | KNN | rec_knn.py | K=500 cosine-similar training playlists, softmax-sharpened neighbour weights, IDF-style query weighting, mild popularity penalty. |
| 6 | KNN (advanced) | rec_knn_advanced.py | Variant with extra neighbour reweighting. |
| Metric | Description |
|---|---|
| HitRate@K | Fraction of test cases with ≥1 hidden song in top-K |
| Recall@K | Average fraction of hidden songs recovered in top-K |
| NDCG@K | Normalised Discounted Cumulative Gain — rewards hits ranked higher |
| MRR@K | Mean Reciprocal Rank of the first correct hit |
| R-Precision | Hits in the top-R predictions, where R = |
Evaluated at K = 10, 20, 40 in the standalone scripts; K = 5, 10, 20, 50 in main.ipynb. Test set: 20% of playlists with ~20% of each playlist's songs hidden.
| Model | HitRate@10 | Recall@10 | NDCG@10 | MRR@10 | Coverage |
|---|---|---|---|---|---|
| Popularity Baseline | 0.028 | 0.008 | 0.007 | 0.014 | 0.09% |
| Co-occurrence | 0.345 | 0.174 | 0.172 | 0.229 | 8.1% |
| BM25 Co-occurrence | 0.421 | 0.206 | 0.206 | 0.276 | 10.9% |
| ALS (factors=32) | 0.099 | 0.028 | 0.029 | 0.054 | 3.9% |
| KNN (K=500) | 0.314 | 0.172 | 0.178 | 0.225 | 6.3% |
BM25 is the best-performing model, achieving a 15× improvement over the popularity baseline and a +22% relative gain over raw co-occurrence.
make summarize regenerates this table at results/summary.md from the per-model JSON metrics.
| Hypothesis | Verdict |
|---|---|
| H1: Popularity is a weak baseline | Confirmed — 2.8% vs 42.1% HitRate@10 |
| H2: Co-occurrence captures genre context | Confirmed — 12× gain over popularity |
| H3: BM25 length-normalisation helps | Confirmed — +22% relative gain over co-occurrence |
| H4: ALS underfits on sparse data | Confirmed — 9.9% HitRate, below co-occurrence |
| H5: KNN competitive on sparse data | Partially confirmed — beats ALS but below co-occurrence at this scale |
CS506-project/
├── Makefile # Build + pipeline targets
├── requirements.txt # Pinned Python dependencies
├── download_data.py # Kaggle download via kagglehub
├── data_cleaning.py # Lenient CSV parse → cleaned dataset
├── data_processing.py # Train/test split + masking
├── explore_data.py # EDA + static PNGs
├── rec_pop.py # Popularity baseline
├── rec_cooc.py # Co-occurrence
├── rec_bm25.py # BM25 co-occurrence (best model)
├── rec_als.py # ALS matrix factorisation
├── rec_knn.py # KNN playlist similarity
├── rec_knn_advanced.py # KNN variant
├── summarize_results.py # Aggregate per-model metrics
├── make_interactive_viz.py # Build interactive Plotly chart
├── recommendation.py # Shared recommendation utilities
├── main.ipynb # Full pipeline as a notebook (Colab-ready)
├── preprocessing.ipynb # Data cleaning + baseline models notebook
├── tests/ # pytest suite (unit + smoke tests)
├── scripts/ # SCC cluster submission scripts
├── figures/ # Visualizations (interactive HTML + PNGs)
├── results/ # Model outputs (created by `make models`)
├── data/ # Raw and processed data (gitignored)
└── report.md # Full project report
See requirements.txt. All pure-Python - no compiled extensions, works on Python 3.11 through 3.14.