Skip to content

boseongkang/mlproject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Project Title : Make your song Billboard Top 100


Authors : Boseong Kang, Geonho Lee

Description of Question and Research Topic

This project aims to comprehensively analyze lyric embeddings and the characteristics of commonly used instruments for Billboard's Top 100 songs, revealing trends over time.
For example, if an artist is tryting to compose a song, our model can predict which instruments with lyrics would be used to enter a Billboard Top 100.
We analyze trends by clustering lyrics and examining word similarity.
Then analyze the instrument usage ratios by spectralizing the songs.
Finally, we combine these two analyses to build a machine learning model that predicts chart entry probability.

Project Outline/Plan

Data Collection Plan (two parts, one for each author)

  • Part 1: Boseong Kang
    Use billboard.py Python library to get Billboard top 100 song's title, rank, and artist.
    Using lyrics from websites have copyright issue so use lyrics data from kaggle.
    https://www.kaggle.com/datasets/bwandowando/spotify-songs-with-attributes-and-lyrics
    This data set has License CC BY-NC-SA 4.0 which means we can free to share, adapt if we use as NonCommercial and give appropriate credit.
    https://www.kaggle.com/datasets/suparnabiswas/billboard-hot-1002000-2023-data-with-features (new dataset)
    New dataset License CC0: Public Domain, -> You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission

  • Part 2: Geonho Lee
    Use the songs collected in Part 1 to analyze their audio characteristics.
    Convert each song’s WAV data into a frequency spectrum using NumPy and SciPy, and extract features such as band energy ratios (bass, vocal, cymbal).
    Estimate the instrument approximation energy ratios to identify which sound ranges are dominant in each song.
    Visualize the results using Matplotlib to compare how different tracks emphasize different sound bands.
    Compare audio patterns with lyrical patterns to analyze overall music trends.

Model Plans (two parts, one for each author)

  • Part 1: Boseong Kang
    Logistic Regression: Similar to the MNIST Dataset, after preprocessing the data, use one-hot encoding or TF-IDF from scikit-learn to classify from the top 10 songs and others.
    MLP: With preprocessed words, we can classify the top 10 songs vs other songs using ReLu and the sigmoid activation function

  • Part 2: Geonho Lee
    Visualization: Use Matplotlib to visualize each song’s frequency spectrum and energy distribution.
    Model: Apply Logistic Regression from scikit-learn to analyze relationships between extracted audio features (band energy ratios) and data from Part 1. Additionally, since non-linear relationships can’t be properly captured by Logistic Regression, we will use MLP implemented with PyTorch to explore non-linear sound patterns across frequency bands.

Project Timeline

Image

Our Project Roadmap link

It may take time to load our Roadmap.
Open the Roadmap on GitHub

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors