Region-wise key estimation for songs that change key
Most key-estimation demos force one global label onto a whole track. This repo keeps the section boundary visible: it estimates likely key regions, reports candidate modulation points, and can pitch-shift each region toward a target key.
{
"target_key_name": "C",
"modulation_points": [{"time_sec": 74.24}],
"region_infos": [
{"start_time_sec": 0.0, "end_time_sec": 74.24, "key_name": "G", "confidence": 0.82},
{"start_time_sec": 74.24, "end_time_sec": 181.76, "key_name": "A", "confidence": 0.77}
]
}- extracts chroma and HPCP-style harmonic pitch-class features
- runs a two-stream Transformer checkpoint
- predicts 12 pitch-class keys per audio window
- groups windows into likely key regions
- exposes approximate modulation points
- serves local CLI and FastAPI inference
- downloads the release checkpoint with SHA-256 verification
git clone https://github.com/SihyeonJeon/Modulation-aware-key-estimator.git
cd Modulation-aware-key-estimator
python -m venv .venv
source .venv/bin/activate
pip install -e .The checkpoint downloads on first use from the GitHub release and is cached
under ~/.cache/modulation-aware-key-estimator/.
Use a local checkpoint instead:
MODEL_CHECKPOINT_PATH=/path/to/key_model.pt mod-key-estimator --wav song.wav --jsonLocal file:
mod-key-estimator --wav song.wav --target-key c --jsonYouTube URL through yt-dlp:
mod-key-estimator --youtube-url "https://www.youtube.com/watch?v=..." --target-key f#If a video requires browser cookies, pass them explicitly:
mod-key-estimator --youtube-url "https://www.youtube.com/watch?v=..." --cookies ./cookies.txtNo cookies file is stored in this repository.
uvicorn modulation_key_estimator.api:app --host 0.0.0.0 --port 8000curl -X POST http://localhost:8000/analyze-file \
-F "file=@song.wav" \
-F "target_key=c"curl -X POST http://localhost:8000/analyze-youtube \
-H "content-type: application/json" \
-d '{"youtube_url":"https://www.youtube.com/watch?v=...","target_key":"c"}'docker build -t modulation-key-estimator .
docker run --rm -p 8000:8000 modulation-key-estimator| Item | Value |
|---|---|
| input | mono audio, resampled to 16 kHz |
| features | chroma + HPCP-style 12-bin harmonic features |
| architecture | two-stream Transformer encoder with attention pooling |
| output | 12 pitch-class probabilities per window |
| regioning | probability-shift grouping across neighboring windows |
| checkpoint | GitHub Release asset with SHA-256 verification |
See docs/model-card.md for intended use, limitations, and failure modes.
Run a labeled manifest:
python scripts/evaluate_manifest.py examples/manifest.example.csv --jsonExpected CSV columns:
path,expected_key
path/to/song.wav,cThe script reports exact pitch-class accuracy and per-file predictions. Replace the example manifest with local labeled audio before reporting a benchmark number.
This repo currently ships the inference package, model architecture, release checkpoint, and manifest-based evaluation script. It does not yet ship the original training code, training manifest, dataset list, or training logs.
The checkpoint predicts pitch class only: C, C#, ..., B. It does not
model major/minor, modal function, enharmonic spelling, or score-level harmonic
analysis.