CS506 Final Project - Spotify Playlist Completion

Video Link - https://youtu.be/2s7LOVB7iKo

Given a partially-observed playlist, predict which songs are missing from it. We frame this as a ranking problem: given a set of observed songs, return a ranked list of candidates and evaluate using HitRate, Recall, NDCG, and MRR at K.

How to build and run

Everything is driven by the Makefile. From a fresh clone:

make install   # pip install all dependencies
make all       # download data, clean, process, train all 6 models, summarize, build interactive viz

make all chains the full pipeline. To run individual stages:

Command	What it does
`make install`	`pip install -r requirements.txt`
`make data`	Download `spotify_dataset.csv` via download_data.py (~1.18 GB, requires Kaggle creds)
`make clean-data`	Run data_cleaning.py → `data/spotify_dataset_clean.csv`
`make process`	Run data_processing.py → `data/spotify/{playlists_train,playlists_test_seen,playlists_test_hidden,song_meta_no_duplicates}.csv`
`make explore`	Run explore_data.py → regenerates the static PNGs in figures/
`make models`	Run all six `rec_*.py` scripts → `results/<tag>_metrics.json` + `results/<tag>_recs.csv` per model
`make summarize`	Aggregate per-model JSONs into `results/summary.csv` and `results/summary.md`
`make viz`	Build the interactive Plotly chart at `figures/results_interactive.html`
`make test`	Run unit + smoke tests via pytest
`make clean`	Remove derived artifacts (keeps the raw 1.18 GB Kaggle download)

Kaggle credentials. make data uses kagglehub, which expects your Kaggle API token at ~/.kaggle/kaggle.json (see Kaggle's API docs).

Colab alternative. Upload main.ipynb and spotify_dataset.csv to Colab, set COLAB_MODE=True, and Runtime → Run all. The notebook is self-contained and reproduces the same results without the Makefile.

Tests

make test

Runs three test modules:

tests/test_preprocessing.py — pure unit tests for normalize_text, masking logic, popularity ranking, and co-occurrence scoring. No external data needed.
tests/test_preprocessing2.py — extracts target functions from preprocessing.ipynb via nbformat and tests their behaviour (CSV cleaning, malformed-row handling, evaluation metrics).
tests/test_rec_modules.py — smoke tests that build a tiny synthetic dataset and run rec_pop.py and rec_cooc.py end-to-end, asserting that each writes valid metrics.

CI runs the same make install && make test on every push and PR — see .github/workflows/test.yml.

Visualizations

File	Type
figures/results_interactive.html	Interactive Plotly bar chart of all model metrics — open in any browser
figures/playlist_size.png	Distribution of playlist sizes
figures/playlists_per_user.png	Number of playlists per user
figures/top_artists.png	Top artists by track count

The interactive chart is regenerated by make viz from results/summary.csv. The static PNGs are regenerated by make explore from the cleaned dataset.

Data processing & modeling

Dataset

Source: Spotify Playlists — Kaggle Raw size: 617K rows of (user_id, artistname, trackname, playlistname) tuples scraped from Spotify's public playlist API. After filtering: ~337K rows / 86.6K songs / 9,296 playlists.

Pipeline

Download — download_data.py fetches the raw CSV via kagglehub.
Clean — data_cleaning.py parses the malformed CSV (unescaped quotes inside fields), strips whitespace, drops nulls, and writes data/spotify_dataset_clean.csv. The "load lenient" routine bypasses pandas' CSV escaping by splitting on the literal "," separator and recovers ~280K rows pandas would otherwise drop.
Process — data_processing.py filters out songs that appear in fewer than MIN_SONG_FREQ=3 playlists (typo / parsing artifacts with no co-occurrence signal), holds out TEST_PLAYLIST_FRAC=0.2 of eligible playlists for evaluation, and within each test playlist masks HIDDEN_SONG_FRAC=0.2 of its tracks. Writes train/test_seen/test_hidden CSVs and a song_id ↔ (artist, track) lookup. RANDOM_SEED=42 makes the split reproducible.
Explore — explore_data.py renders cardinality summaries and the three PNG distributions in figures/.

Models

#	Model	Source	Idea
1	Popularity	rec_pop.py	Recommend globally most-played songs to every user. Null model — any useful recommender must beat it.
2	Co-occurrence	rec_cooc.py	For each seed song, aggregate songs that co-appear in training playlists. Score = raw count.
3	BM25 Co-occurrence	rec_bm25.py	BM25 (k1=1.5, b=0.75) weighting over the co-occurrence matrix: penalises seeds from very large playlists, rewards seeds supported by many small independent playlists.
4	ALS	rec_als.py	Alternating Least Squares with SVD warm-start, 32 latent factors, 10 iterations. Pure NumPy + SciPy — no compiled extensions.
5	KNN	rec_knn.py	K=500 cosine-similar training playlists, softmax-sharpened neighbour weights, IDF-style query weighting, mild popularity penalty.
6	KNN (advanced)	rec_knn_advanced.py	Variant with extra neighbour reweighting.

Evaluation metrics

Metric	Description
HitRate@K	Fraction of test cases with ≥1 hidden song in top-K
Recall@K	Average fraction of hidden songs recovered in top-K
NDCG@K	Normalised Discounted Cumulative Gain — rewards hits ranked higher
MRR@K	Mean Reciprocal Rank of the first correct hit
R-Precision	Hits in the top-R predictions, where R =

Evaluated at K = 10, 20, 40 in the standalone scripts; K = 5, 10, 20, 50 in main.ipynb. Test set: 20% of playlists with ~20% of each playlist's songs hidden.

Results

Model	HitRate@10	Recall@10	NDCG@10	MRR@10	Coverage
Popularity Baseline	0.028	0.008	0.007	0.014	0.09%
Co-occurrence	0.345	0.174	0.172	0.229	8.1%
BM25 Co-occurrence	0.421	0.206	0.206	0.276	10.9%
ALS (factors=32)	0.099	0.028	0.029	0.054	3.9%
KNN (K=500)	0.314	0.172	0.178	0.225	6.3%

BM25 is the best-performing model, achieving a 15× improvement over the popularity baseline and a +22% relative gain over raw co-occurrence.

make summarize regenerates this table at results/summary.md from the per-model JSON metrics.

Hypotheses & Verdicts

Hypothesis	Verdict
H1: Popularity is a weak baseline	Confirmed — 2.8% vs 42.1% HitRate@10
H2: Co-occurrence captures genre context	Confirmed — 12× gain over popularity
H3: BM25 length-normalisation helps	Confirmed — +22% relative gain over co-occurrence
H4: ALS underfits on sparse data	Confirmed — 9.9% HitRate, below co-occurrence
H5: KNN competitive on sparse data	Partially confirmed — beats ALS but below co-occurrence at this scale

Repository structure

CS506-project/
├── Makefile                     # Build + pipeline targets
├── requirements.txt             # Pinned Python dependencies
├── download_data.py             # Kaggle download via kagglehub
├── data_cleaning.py             # Lenient CSV parse → cleaned dataset
├── data_processing.py           # Train/test split + masking
├── explore_data.py              # EDA + static PNGs
├── rec_pop.py                   # Popularity baseline
├── rec_cooc.py                  # Co-occurrence
├── rec_bm25.py                  # BM25 co-occurrence (best model)
├── rec_als.py                   # ALS matrix factorisation
├── rec_knn.py                   # KNN playlist similarity
├── rec_knn_advanced.py          # KNN variant
├── summarize_results.py         # Aggregate per-model metrics
├── make_interactive_viz.py      # Build interactive Plotly chart
├── recommendation.py            # Shared recommendation utilities
├── main.ipynb                   # Full pipeline as a notebook (Colab-ready)
├── preprocessing.ipynb          # Data cleaning + baseline models notebook
├── tests/                       # pytest suite (unit + smoke tests)
├── scripts/                     # SCC cluster submission scripts
├── figures/                     # Visualizations (interactive HTML + PNGs)
├── results/                     # Model outputs (created by `make models`)
├── data/                        # Raw and processed data (gitignored)
└── report.md                    # Full project report

Dependencies

See requirements.txt. All pure-Python - no compiled extensions, works on Python 3.11 through 3.14.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS506 Final Project - Spotify Playlist Completion

How to build and run

Tests

Visualizations

Data processing & modeling

Dataset

Pipeline

Models

Evaluation metrics

Results

Hypotheses & Verdicts

Repository structure

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
figures		figures
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
data_cleaning.py		data_cleaning.py
data_processing.py		data_processing.py
download_data.py		download_data.py
explore_data.py		explore_data.py
main.ipynb		main.ipynb
make_interactive_viz.py		make_interactive_viz.py
preprocessing.ipynb		preprocessing.ipynb
rec_als.py		rec_als.py
rec_bm25.py		rec_bm25.py
rec_cooc.py		rec_cooc.py
rec_knn.py		rec_knn.py
rec_knn_advanced.py		rec_knn_advanced.py
rec_pop.py		rec_pop.py
recommendation.py		recommendation.py
report.md		report.md
requirements.txt		requirements.txt
summarize_results.py		summarize_results.py
visualizations.png		visualizations.png

Folders and files

Latest commit

History

Repository files navigation

CS506 Final Project - Spotify Playlist Completion

How to build and run

Tests

Visualizations

Data processing & modeling

Dataset

Pipeline

Models

Evaluation metrics

Results

Hypotheses & Verdicts

Repository structure

Dependencies

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages