This repository provides codes for reproducing experiments of the following paper:
Kando, S., Miyao, Y., Takamichi, S. (2025) Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language Models. Proc. Interspeech 2025, 5728-5732, doi: 10.21437/Interspeech.2025-310
We use two environment management tools:
To reproduce the experimental environment, install Python 3.10.13 with pyenv and then run poetry install.
You may use other tools such as venv, conda or rye.
There is no garantee that the implementation will work as expected.
Download train and dev set of LibriSpeech from the website. Extracting the archives should produce the following directory structure.
LibriSpeech
├── BOOKS.TXT
├── CHAPTERS.TXT
├── LICENSE.TXT
├── README.TXT
├── SPEAKERS.TXT
├── dev
│ ├── dev-clean
│ └── dev-other
└── train
├── train-clean-100
├── train-clean-360
└── train-other-500
Download the following benchmarks:
- sBLIMP and sWUGGY: included in
sLM21-dataset - prosaudit: included in
prosaudit-dataset - tSC (Topic StoryCloze)
We prepare for the example scripts under the example directory.
Please refer to the comments in script files as well.
Run example/kmeans_train.sh.
Run example/kmeans_infer.sh.
Note that this should be done for all the dataset prepared, i.e. LibriSpeech (train and dev) and benchmarks (sBLIMP, sWUGGY, etc.).
@inproceedings{kando25_interspeech,
title = {{Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language Models}},
author = {Shunsuke Kando and Yusuke Miyao and Shinnosuke Takamichi},
year = {2025},
booktitle = {{Interspeech 2025}},
pages = {5728--5732},
doi = {10.21437/Interspeech.2025-310},
issn = {2958-1796},
}