Bandwidth Extension fine-tuning experiment for S.O.S — Sound of the Sky.
Goal: fine-tune ClearerVoice-Studio MossFormer2_SR_48K on Twilio G.711 µ-law degraded speech pairs to reconstruct frequencies above 4kHz on telephone voice messages.
S.O.S collects voice messages (wishes) via telephone (Twilio). The audio arrives as MP3 22kHz mono, but the effective bandwidth is narrowband: 300Hz–3.8kHz (G.711 µ-law 8kHz codec upstream, re-encoded to MP3 by Twilio). All pre-trained speech super-resolution models fail on this specific degradation because they were trained on simple LPF downsampling, not on µ-law codec artifacts. This repo fine-tunes MossFormer2_SR_48K on matched Twilio-degraded pairs.
sos-bwe-experiment/
├── scripts/
│ ├── extract_common_voice.py # Extract validated clips from Common Voice tar.gz
│ ├── simulate_twilio.py # Generate Twilio-degraded pairs from clean WAV
│ └── prepare_cv_dataset.py # (legacy) HuggingFace streaming approach
├── patches/
│ ├── dataloader.patch # Patch for ClearerVoice dataloader
│ └── twilio_finetune.yaml # Training config for fine-tuning
├── setup_runpod.sh # Automated setup script for RunPod
├── data/ # (gitignored) Audio data
│ └── pairs/
│ ├── clean/ # Clean WAV 48kHz 16bit mono
│ └── degraded/ # Twilio-simulated degraded WAV 48kHz 16bit mono
└── .gitignore
pyenv install 3.11.9
pyenv local 3.11.9
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtDownload the French dataset from https://commonvoice.mozilla.org (~28GB tar.gz).
# Edit the archive path in the script if needed, then:
python scripts/extract_common_voice.pyThis extracts 3000 validated MP3 clips, converts them to WAV 48kHz mono 24bit, and filters by duration (2-15s). Output: data/pairs/clean/cv_fr_00000.wav ... cv_fr_02999.wav.
python scripts/simulate_twilio.py data/pairs/clean/ data/pairs/degraded/This simulates the Twilio degradation chain on each clean file:
WAV 48kHz → G.711 µ-law 8kHz → MP3 22kHz 32kbps → WAV 48kHz
Simulation fidelity was validated against 5 real Twilio recordings: waveform correlation 0.997-0.999, spectral differences <0.7dB.
The WAV files from ffmpeg may use WAVE_FORMAT_EXTENSIBLE (format 65534) which ClearerVoice cannot read. Convert to standard PCM:
# Run this on the machine where training will happen
python -c "
import subprocess, os
from pathlib import Path
for d in ['clean', 'degraded']:
dir_path = Path('data/pairs') / d
files = sorted(dir_path.glob('*.wav'))
fixed = 0
for f in files:
tmp = f.with_suffix('.tmp.wav')
result = subprocess.run(
['ffmpeg', '-i', str(f), '-acodec', 'pcm_s16le', '-ar', '48000', '-ac', '1', '-y', str(tmp)],
capture_output=True, text=True
)
if result.returncode == 0:
tmp.rename(f)
fixed += 1
if fixed % 500 == 0 and fixed > 0:
print(f' {d}: fixed {fixed}/{len(files)}')
print(f'{d}: fixed {fixed} files total')
"- GPU: A100 SXM 80GB (or A40 48GB for cheaper runs)
- Template: RunPod PyTorch 2.4.0 (py3.11, CUDA 12.4)
- Storage: Container Disk 50GB, Volume Disk 50GB
- Enable SSH terminal access
ssh root@<POD_IP> -p <PORT> -i ~/.ssh/id_ed25519cd /workspace
git clone https://github.com/aker-dev/sos-bwe-experiment.git
git clone https://github.com/modelscope/ClearerVoice-Studio.gitNote: if the repo is private, either make it temporarily public or use a GitHub Personal Access Token:
git clone https://<TOKEN>@github.com/aker-dev/sos-bwe-experiment.gitpip install -r ClearerVoice-Studio/requirements.txt
pip install huggingface_hub soundfile librosa
pip install "numpy<2" --force-reinstall
pip install pesq --force-reinstall --no-binary pesq
apt-get update && apt-get install -y ffmpeg rsyncpython -c "
from huggingface_hub import snapshot_download
snapshot_download(
repo_id='alibabasglab/MossFormer2_SR_48K',
local_dir='ClearerVoice-Studio/train/speech_super_resolution/checkpoints/MossFormer2_SR_48K'
)
print('Done')
"From your local machine (not the pod):
cd ~/Documents/GitHub/sos-bwe-experiment
# Create dirs on pod first (via another SSH session):
# mkdir -p /workspace/sos-bwe-experiment/data/pairs/clean
# mkdir -p /workspace/sos-bwe-experiment/data/pairs/degraded
rsync -avz --no-owner --no-group --progress \
data/pairs/ \
root@<POD_IP>:/workspace/sos-bwe-experiment/data/pairs/ \
-e "ssh -p <PORT>"python -c "
import subprocess, os
from pathlib import Path
for d in ['clean', 'degraded']:
dir_path = Path(f'/workspace/sos-bwe-experiment/data/pairs/{d}')
files = sorted(dir_path.glob('*.wav'))
fixed = 0
for f in files:
tmp = f.with_suffix('.tmp.wav')
result = subprocess.run(
['ffmpeg', '-i', str(f), '-acodec', 'pcm_s16le', '-ar', '48000', '-ac', '1', '-y', str(tmp)],
capture_output=True, text=True
)
if result.returncode == 0:
tmp.rename(f)
fixed += 1
if fixed % 500 == 0 and fixed > 0:
print(f' {d}: fixed {fixed}/{len(files)}')
print(f'{d}: fixed {fixed} files total')
"cd /workspace/ClearerVoice-Studio/train/speech_super_resolution
# Apply the patch manually (the .patch file may not apply cleanly due to whitespace)
python -c "
content = open('dataloader/dataloader.py').read()
# Find the exact old lines (check with: sed -n '550,556p' dataloader/dataloader.py)
# The old code does random downsampling:
# sr_out = random.choice(args.supported_sampling_rates)
# audio_down = resample(audio, args.sampling_rate, sr_out)
# target_len = len(audio)
# audio_input = resample(audio_down, None, None, target_len)
# We need to match the exact whitespace. Print lines around sr_out to verify:
lines = content.split('\n')
for i, l in enumerate(lines):
if 'sr_out = random' in l:
for j in range(i, min(i+5, len(lines))):
print(f'Line {j+1}: {repr(lines[j])}')
break
"
# Then use the exact matched string to patch (adapt based on output above)
# See patches/dataloader.patch for the intended changeThe patch replaces the random downsampling with pre-degraded Twilio pair loading:
# BEFORE (original):
sr_out = random.choice(args.supported_sampling_rates)
audio_down = resample(audio, args.sampling_rate, sr_out)
target_len = len(audio)
audio_input = resample(audio_down, None, None, target_len)
# AFTER (patched):
degraded_dir = os.path.dirname(filename).replace('/clean', '/degraded')
basename = os.path.basename(filename)
degraded_path = os.path.join(degraded_dir, basename)
target_len = len(audio)
audio_input = None
if os.path.exists(degraded_path):
audio_input, _ = load_segment(degraded_path, args.sampling_rate, args.segment_size)
if audio_input is not None and len(audio_input) != target_len:
audio_input = resample(audio_input, None, None, target_len)
if audio_input is None:
sr_out = random.choice(args.supported_sampling_rates)
audio_down = resample(audio, args.sampling_rate, sr_out)
audio_input = resample(audio_down, None, None, target_len)# Training config
cat > config/train/twilio_finetune.yaml << 'EOF'
mode: 'train'
network: "MossFormer2_SR_48K"
config_json: "config/train/MossFormer2_SR_48K.json"
checkpoint_dir: "checkpoints/MossFormer2_SR_48K"
tr_list: 'data/train.scp'
cv_list: 'data/cv.scp'
tt_list: 'None'
batch_size: 8
max_epoch: 50
weight_decay: 0.00001
clip_grad_norm: 10.
seed: 777
accu_grad: 1
effec_batch_size: 16
max_length: 4
EOF
# SCP files (90/10 train/val split)
mkdir -p data
python -c "
from pathlib import Path
clean_dir = Path('/workspace/sos-bwe-experiment/data/pairs/clean')
files = sorted(clean_dir.glob('*.wav'))
split = int(len(files) * 0.9)
train = [str(f) for f in files[:split]]
val = [str(f) for f in files[split:]]
Path('data/train.scp').write_text('\n'.join(train) + '\n')
Path('data/cv.scp').write_text('\n'.join(val) + '\n')
print(f'Train: {len(train)}, Val: {len(val)}')
"CUDA_VISIBLE_DEVICES=0 python -W ignore \
train.py \
--config config/train/twilio_finetune.yaml \
--checkpoint_dir checkpoints/MossFormer2_SR_48K \
--train_from_last_checkpoint 1 \
--print_freq 10 \
--checkpoint_save_freq 500Expected output:
- A100 80GB: ~0.007s/batch, ~2.4s/epoch, ~2 min for 50 epochs
- A40 48GB: slightly slower, still under 10 min total
- Loss should start at Gen_Loss ~100, Disc_Loss ~8 and decrease
# Copy real Twilio test files to the pod, then:
cat > config/inference/twilio_finetune.yaml << 'EOF'
mode: 'inference'
config_json: "config/inference/MossFormer2_SR_48K.json"
use_cuda: 1
num_gpu: 1
sampling_rate: 48000
network: "MossFormer2_SR_48K"
checkpoint_dir: "checkpoints/MossFormer2_SR_48K"
input_path: "data/test_twilio.scp"
output_dir: "outputs/twilio_finetuned"
one_time_decode_length: 20
decode_window: 4
EOF
# Create test SCP with paths to real Twilio files
find /workspace/sos-bwe-experiment/data/real_twilio/ -name "*.wav" > data/test_twilio.scp
python inference.py --config config/inference/twilio_finetune.yamlFrom your local machine:
rsync -avz --no-owner --no-group \
root@<POD_IP>:/workspace/ClearerVoice-Studio/train/speech_super_resolution/outputs/ \
~/Documents/GitHub/sos-bwe-experiment/outputs/ \
-e "ssh -p <PORT>"
# Also download the fine-tuned checkpoint
rsync -avz --no-owner --no-group \
root@<POD_IP>:/workspace/ClearerVoice-Studio/train/speech_super_resolution/checkpoints/MossFormer2_SR_48K/ \
~/Documents/GitHub/sos-bwe-experiment/checkpoints/MossFormer2_SR_48K/ \
-e "ssh -p <PORT>"Don't forget to stop/terminate the pod when done to avoid charges.
WAV files use WAVE_FORMAT_EXTENSIBLE header. Fix with step 7 above (ffmpeg reconversion to pcm_s16le).
pip install "numpy<2" --force-reinstall
pip install pesq --force-reinstall --no-binary pesqThe .patch file may not match due to trailing whitespace or blank lines. Use the Python string replacement approach — print the exact lines around sr_out = random and match them precisely.
Patch train.py and inference.py to force CPU:
sed -i "s/device = torch.device('cuda') if args.use_cuda else torch.device('cpu')/device = torch.device('cpu')/" train.py
sed -i 's/torch.cuda.empty_cache()/#torch.cuda.empty_cache()/' train.py
sed -i 's/torch.cuda.set_device(args.local_rank)/#torch.cuda.set_device(args.local_rank)/' train.py| GPU | Price/hr | Estimated training time (50 epochs, 3000 pairs) | Total cost |
|---|---|---|---|
| A100 SXM 80GB | $1.49/hr (RunPod) | ~15 min | ~$0.40 |
| A40 48GB | $0.40/hr (RunPod) | ~30 min | ~$0.20 |
| MacBook Air M2 (CPU) | Free | ~12 hours | $0 |
- ClearerVoice-Studio — Alibaba/ModelScope
- MossFormer2_SR_48K — Pretrained checkpoint
- Common Voice — Mozilla Foundation