Skip to content

Sanjiv01/CS506-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CS506 Final Project - Spotify Playlist Completion

Video Link - https://youtu.be/2s7LOVB7iKo

Given a partially-observed playlist, predict which songs are missing from it. We frame this as a ranking problem: given a set of observed songs, return a ranked list of candidates and evaluate using HitRate, Recall, NDCG, and MRR at K.


How to build and run

Everything is driven by the Makefile. From a fresh clone:

make install   # pip install all dependencies
make all       # download data, clean, process, train all 6 models, summarize, build interactive viz

make all chains the full pipeline. To run individual stages:

Command What it does
make install pip install -r requirements.txt
make data Download spotify_dataset.csv via download_data.py (~1.18 GB, requires Kaggle creds)
make clean-data Run data_cleaning.pydata/spotify_dataset_clean.csv
make process Run data_processing.pydata/spotify/{playlists_train,playlists_test_seen,playlists_test_hidden,song_meta_no_duplicates}.csv
make explore Run explore_data.py → regenerates the static PNGs in figures/
make models Run all six rec_*.py scripts → results/<tag>_metrics.json + results/<tag>_recs.csv per model
make summarize Aggregate per-model JSONs into results/summary.csv and results/summary.md
make viz Build the interactive Plotly chart at figures/results_interactive.html
make test Run unit + smoke tests via pytest
make clean Remove derived artifacts (keeps the raw 1.18 GB Kaggle download)

Kaggle credentials. make data uses kagglehub, which expects your Kaggle API token at ~/.kaggle/kaggle.json (see Kaggle's API docs).

Colab alternative. Upload main.ipynb and spotify_dataset.csv to Colab, set COLAB_MODE=True, and Runtime → Run all. The notebook is self-contained and reproduces the same results without the Makefile.


Tests

make test

Runs three test modules:

CI runs the same make install && make test on every push and PR — see .github/workflows/test.yml.


Visualizations

File Type
figures/results_interactive.html Interactive Plotly bar chart of all model metrics — open in any browser
figures/playlist_size.png Distribution of playlist sizes
figures/playlists_per_user.png Number of playlists per user
figures/top_artists.png Top artists by track count

The interactive chart is regenerated by make viz from results/summary.csv. The static PNGs are regenerated by make explore from the cleaned dataset.


Data processing & modeling

Dataset

Source: Spotify Playlists — Kaggle Raw size: 617K rows of (user_id, artistname, trackname, playlistname) tuples scraped from Spotify's public playlist API. After filtering: ~337K rows / 86.6K songs / 9,296 playlists.

Pipeline

  1. Downloaddownload_data.py fetches the raw CSV via kagglehub.
  2. Cleandata_cleaning.py parses the malformed CSV (unescaped quotes inside fields), strips whitespace, drops nulls, and writes data/spotify_dataset_clean.csv. The "load lenient" routine bypasses pandas' CSV escaping by splitting on the literal "," separator and recovers ~280K rows pandas would otherwise drop.
  3. Processdata_processing.py filters out songs that appear in fewer than MIN_SONG_FREQ=3 playlists (typo / parsing artifacts with no co-occurrence signal), holds out TEST_PLAYLIST_FRAC=0.2 of eligible playlists for evaluation, and within each test playlist masks HIDDEN_SONG_FRAC=0.2 of its tracks. Writes train/test_seen/test_hidden CSVs and a song_id ↔ (artist, track) lookup. RANDOM_SEED=42 makes the split reproducible.
  4. Exploreexplore_data.py renders cardinality summaries and the three PNG distributions in figures/.

Models

# Model Source Idea
1 Popularity rec_pop.py Recommend globally most-played songs to every user. Null model — any useful recommender must beat it.
2 Co-occurrence rec_cooc.py For each seed song, aggregate songs that co-appear in training playlists. Score = raw count.
3 BM25 Co-occurrence rec_bm25.py BM25 (k1=1.5, b=0.75) weighting over the co-occurrence matrix: penalises seeds from very large playlists, rewards seeds supported by many small independent playlists.
4 ALS rec_als.py Alternating Least Squares with SVD warm-start, 32 latent factors, 10 iterations. Pure NumPy + SciPy — no compiled extensions.
5 KNN rec_knn.py K=500 cosine-similar training playlists, softmax-sharpened neighbour weights, IDF-style query weighting, mild popularity penalty.
6 KNN (advanced) rec_knn_advanced.py Variant with extra neighbour reweighting.

Evaluation metrics

Metric Description
HitRate@K Fraction of test cases with ≥1 hidden song in top-K
Recall@K Average fraction of hidden songs recovered in top-K
NDCG@K Normalised Discounted Cumulative Gain — rewards hits ranked higher
MRR@K Mean Reciprocal Rank of the first correct hit
R-Precision Hits in the top-R predictions, where R =

Evaluated at K = 10, 20, 40 in the standalone scripts; K = 5, 10, 20, 50 in main.ipynb. Test set: 20% of playlists with ~20% of each playlist's songs hidden.


Results

Model HitRate@10 Recall@10 NDCG@10 MRR@10 Coverage
Popularity Baseline 0.028 0.008 0.007 0.014 0.09%
Co-occurrence 0.345 0.174 0.172 0.229 8.1%
BM25 Co-occurrence 0.421 0.206 0.206 0.276 10.9%
ALS (factors=32) 0.099 0.028 0.029 0.054 3.9%
KNN (K=500) 0.314 0.172 0.178 0.225 6.3%

BM25 is the best-performing model, achieving a 15× improvement over the popularity baseline and a +22% relative gain over raw co-occurrence.

make summarize regenerates this table at results/summary.md from the per-model JSON metrics.


Hypotheses & Verdicts

Hypothesis Verdict
H1: Popularity is a weak baseline Confirmed — 2.8% vs 42.1% HitRate@10
H2: Co-occurrence captures genre context Confirmed — 12× gain over popularity
H3: BM25 length-normalisation helps Confirmed — +22% relative gain over co-occurrence
H4: ALS underfits on sparse data Confirmed — 9.9% HitRate, below co-occurrence
H5: KNN competitive on sparse data Partially confirmed — beats ALS but below co-occurrence at this scale

Repository structure

CS506-project/
├── Makefile                     # Build + pipeline targets
├── requirements.txt             # Pinned Python dependencies
├── download_data.py             # Kaggle download via kagglehub
├── data_cleaning.py             # Lenient CSV parse → cleaned dataset
├── data_processing.py           # Train/test split + masking
├── explore_data.py              # EDA + static PNGs
├── rec_pop.py                   # Popularity baseline
├── rec_cooc.py                  # Co-occurrence
├── rec_bm25.py                  # BM25 co-occurrence (best model)
├── rec_als.py                   # ALS matrix factorisation
├── rec_knn.py                   # KNN playlist similarity
├── rec_knn_advanced.py          # KNN variant
├── summarize_results.py         # Aggregate per-model metrics
├── make_interactive_viz.py      # Build interactive Plotly chart
├── recommendation.py            # Shared recommendation utilities
├── main.ipynb                   # Full pipeline as a notebook (Colab-ready)
├── preprocessing.ipynb          # Data cleaning + baseline models notebook
├── tests/                       # pytest suite (unit + smoke tests)
├── scripts/                     # SCC cluster submission scripts
├── figures/                     # Visualizations (interactive HTML + PNGs)
├── results/                     # Model outputs (created by `make models`)
├── data/                        # Raw and processed data (gitignored)
└── report.md                    # Full project report

Dependencies

See requirements.txt. All pure-Python - no compiled extensions, works on Python 3.11 through 3.14.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors