Skip to content

aker-dev/sos-bwe-experiment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sos-bwe-experiment

Bandwidth Extension fine-tuning experiment for S.O.S — Sound of the Sky.

Goal: fine-tune ClearerVoice-Studio MossFormer2_SR_48K on Twilio G.711 µ-law degraded speech pairs to reconstruct frequencies above 4kHz on telephone voice messages.

Context

S.O.S collects voice messages (wishes) via telephone (Twilio). The audio arrives as MP3 22kHz mono, but the effective bandwidth is narrowband: 300Hz–3.8kHz (G.711 µ-law 8kHz codec upstream, re-encoded to MP3 by Twilio). All pre-trained speech super-resolution models fail on this specific degradation because they were trained on simple LPF downsampling, not on µ-law codec artifacts. This repo fine-tunes MossFormer2_SR_48K on matched Twilio-degraded pairs.

Repository structure

sos-bwe-experiment/
├── scripts/
│   ├── extract_common_voice.py   # Extract validated clips from Common Voice tar.gz
│   ├── simulate_twilio.py        # Generate Twilio-degraded pairs from clean WAV
│   └── prepare_cv_dataset.py     # (legacy) HuggingFace streaming approach
├── patches/
│   ├── dataloader.patch          # Patch for ClearerVoice dataloader
│   └── twilio_finetune.yaml      # Training config for fine-tuning
├── setup_runpod.sh               # Automated setup script for RunPod
├── data/                         # (gitignored) Audio data
│   └── pairs/
│       ├── clean/                # Clean WAV 48kHz 16bit mono
│       └── degraded/             # Twilio-simulated degraded WAV 48kHz 16bit mono
└── .gitignore

Local setup

pyenv install 3.11.9
pyenv local 3.11.9
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Dataset preparation (local)

1. Download Common Voice FR

Download the French dataset from https://commonvoice.mozilla.org (~28GB tar.gz).

2. Extract and convert clips

# Edit the archive path in the script if needed, then:
python scripts/extract_common_voice.py

This extracts 3000 validated MP3 clips, converts them to WAV 48kHz mono 24bit, and filters by duration (2-15s). Output: data/pairs/clean/cv_fr_00000.wav ... cv_fr_02999.wav.

3. Generate Twilio-degraded pairs

python scripts/simulate_twilio.py data/pairs/clean/ data/pairs/degraded/

This simulates the Twilio degradation chain on each clean file: WAV 48kHz → G.711 µ-law 8kHz → MP3 22kHz 32kbps → WAV 48kHz

Simulation fidelity was validated against 5 real Twilio recordings: waveform correlation 0.997-0.999, spectral differences <0.7dB.

4. Fix WAV format (important!)

The WAV files from ffmpeg may use WAVE_FORMAT_EXTENSIBLE (format 65534) which ClearerVoice cannot read. Convert to standard PCM:

# Run this on the machine where training will happen
python -c "
import subprocess, os
from pathlib import Path

for d in ['clean', 'degraded']:
    dir_path = Path('data/pairs') / d
    files = sorted(dir_path.glob('*.wav'))
    fixed = 0
    for f in files:
        tmp = f.with_suffix('.tmp.wav')
        result = subprocess.run(
            ['ffmpeg', '-i', str(f), '-acodec', 'pcm_s16le', '-ar', '48000', '-ac', '1', '-y', str(tmp)],
            capture_output=True, text=True
        )
        if result.returncode == 0:
            tmp.rename(f)
            fixed += 1
        if fixed % 500 == 0 and fixed > 0:
            print(f'  {d}: fixed {fixed}/{len(files)}')
    print(f'{d}: fixed {fixed} files total')
"

Training on RunPod (step by step)

1. Create a RunPod pod

  • GPU: A100 SXM 80GB (or A40 48GB for cheaper runs)
  • Template: RunPod PyTorch 2.4.0 (py3.11, CUDA 12.4)
  • Storage: Container Disk 50GB, Volume Disk 50GB
  • Enable SSH terminal access

2. SSH into the pod

ssh root@<POD_IP> -p <PORT> -i ~/.ssh/id_ed25519

3. Clone repos

cd /workspace
git clone https://github.com/aker-dev/sos-bwe-experiment.git
git clone https://github.com/modelscope/ClearerVoice-Studio.git

Note: if the repo is private, either make it temporarily public or use a GitHub Personal Access Token:

git clone https://<TOKEN>@github.com/aker-dev/sos-bwe-experiment.git

4. Install dependencies

pip install -r ClearerVoice-Studio/requirements.txt
pip install huggingface_hub soundfile librosa
pip install "numpy<2" --force-reinstall
pip install pesq --force-reinstall --no-binary pesq
apt-get update && apt-get install -y ffmpeg rsync

5. Download pretrained checkpoint

python -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='alibabasglab/MossFormer2_SR_48K',
    local_dir='ClearerVoice-Studio/train/speech_super_resolution/checkpoints/MossFormer2_SR_48K'
)
print('Done')
"

6. Upload dataset from local machine

From your local machine (not the pod):

cd ~/Documents/GitHub/sos-bwe-experiment

# Create dirs on pod first (via another SSH session):
# mkdir -p /workspace/sos-bwe-experiment/data/pairs/clean
# mkdir -p /workspace/sos-bwe-experiment/data/pairs/degraded

rsync -avz --no-owner --no-group --progress \
    data/pairs/ \
    root@<POD_IP>:/workspace/sos-bwe-experiment/data/pairs/ \
    -e "ssh -p <PORT>"

7. Fix WAV format on the pod

python -c "
import subprocess, os
from pathlib import Path

for d in ['clean', 'degraded']:
    dir_path = Path(f'/workspace/sos-bwe-experiment/data/pairs/{d}')
    files = sorted(dir_path.glob('*.wav'))
    fixed = 0
    for f in files:
        tmp = f.with_suffix('.tmp.wav')
        result = subprocess.run(
            ['ffmpeg', '-i', str(f), '-acodec', 'pcm_s16le', '-ar', '48000', '-ac', '1', '-y', str(tmp)],
            capture_output=True, text=True
        )
        if result.returncode == 0:
            tmp.rename(f)
            fixed += 1
        if fixed % 500 == 0 and fixed > 0:
            print(f'  {d}: fixed {fixed}/{len(files)}')
    print(f'{d}: fixed {fixed} files total')
"

8. Patch the dataloader

cd /workspace/ClearerVoice-Studio/train/speech_super_resolution

# Apply the patch manually (the .patch file may not apply cleanly due to whitespace)
python -c "
content = open('dataloader/dataloader.py').read()

# Find the exact old lines (check with: sed -n '550,556p' dataloader/dataloader.py)
# The old code does random downsampling:
#   sr_out = random.choice(args.supported_sampling_rates)
#   audio_down = resample(audio, args.sampling_rate, sr_out)
#   target_len = len(audio)
#   audio_input = resample(audio_down, None, None, target_len)

# We need to match the exact whitespace. Print lines around sr_out to verify:
lines = content.split('\n')
for i, l in enumerate(lines):
    if 'sr_out = random' in l:
        for j in range(i, min(i+5, len(lines))):
            print(f'Line {j+1}: {repr(lines[j])}')
        break
"

# Then use the exact matched string to patch (adapt based on output above)
# See patches/dataloader.patch for the intended change

The patch replaces the random downsampling with pre-degraded Twilio pair loading:

# BEFORE (original):
sr_out = random.choice(args.supported_sampling_rates)
audio_down = resample(audio, args.sampling_rate, sr_out)
target_len = len(audio)
audio_input = resample(audio_down, None, None, target_len)

# AFTER (patched):
degraded_dir = os.path.dirname(filename).replace('/clean', '/degraded')
basename = os.path.basename(filename)
degraded_path = os.path.join(degraded_dir, basename)
target_len = len(audio)
audio_input = None
if os.path.exists(degraded_path):
    audio_input, _ = load_segment(degraded_path, args.sampling_rate, args.segment_size)
    if audio_input is not None and len(audio_input) != target_len:
        audio_input = resample(audio_input, None, None, target_len)
if audio_input is None:
    sr_out = random.choice(args.supported_sampling_rates)
    audio_down = resample(audio, args.sampling_rate, sr_out)
    audio_input = resample(audio_down, None, None, target_len)

9. Create training config and SCP files

# Training config
cat > config/train/twilio_finetune.yaml << 'EOF'
mode: 'train'
network: "MossFormer2_SR_48K"
config_json: "config/train/MossFormer2_SR_48K.json"
checkpoint_dir: "checkpoints/MossFormer2_SR_48K"
tr_list: 'data/train.scp'
cv_list: 'data/cv.scp'
tt_list: 'None'
batch_size: 8
max_epoch: 50
weight_decay: 0.00001
clip_grad_norm: 10.
seed: 777
accu_grad: 1
effec_batch_size: 16
max_length: 4
EOF

# SCP files (90/10 train/val split)
mkdir -p data
python -c "
from pathlib import Path
clean_dir = Path('/workspace/sos-bwe-experiment/data/pairs/clean')
files = sorted(clean_dir.glob('*.wav'))
split = int(len(files) * 0.9)
train = [str(f) for f in files[:split]]
val = [str(f) for f in files[split:]]
Path('data/train.scp').write_text('\n'.join(train) + '\n')
Path('data/cv.scp').write_text('\n'.join(val) + '\n')
print(f'Train: {len(train)}, Val: {len(val)}')
"

10. Launch training

CUDA_VISIBLE_DEVICES=0 python -W ignore \
    train.py \
    --config config/train/twilio_finetune.yaml \
    --checkpoint_dir checkpoints/MossFormer2_SR_48K \
    --train_from_last_checkpoint 1 \
    --print_freq 10 \
    --checkpoint_save_freq 500

Expected output:

  • A100 80GB: ~0.007s/batch, ~2.4s/epoch, ~2 min for 50 epochs
  • A40 48GB: slightly slower, still under 10 min total
  • Loss should start at Gen_Loss ~100, Disc_Loss ~8 and decrease

11. Run inference on test files

# Copy real Twilio test files to the pod, then:
cat > config/inference/twilio_finetune.yaml << 'EOF'
mode: 'inference'
config_json: "config/inference/MossFormer2_SR_48K.json"
use_cuda: 1
num_gpu: 1
sampling_rate: 48000
network: "MossFormer2_SR_48K"
checkpoint_dir: "checkpoints/MossFormer2_SR_48K"
input_path: "data/test_twilio.scp"
output_dir: "outputs/twilio_finetuned"
one_time_decode_length: 20
decode_window: 4
EOF

# Create test SCP with paths to real Twilio files
find /workspace/sos-bwe-experiment/data/real_twilio/ -name "*.wav" > data/test_twilio.scp

python inference.py --config config/inference/twilio_finetune.yaml

12. Download results

From your local machine:

rsync -avz --no-owner --no-group \
    root@<POD_IP>:/workspace/ClearerVoice-Studio/train/speech_super_resolution/outputs/ \
    ~/Documents/GitHub/sos-bwe-experiment/outputs/ \
    -e "ssh -p <PORT>"

# Also download the fine-tuned checkpoint
rsync -avz --no-owner --no-group \
    root@<POD_IP>:/workspace/ClearerVoice-Studio/train/speech_super_resolution/checkpoints/MossFormer2_SR_48K/ \
    ~/Documents/GitHub/sos-bwe-experiment/checkpoints/MossFormer2_SR_48K/ \
    -e "ssh -p <PORT>"

13. Stop the pod

Don't forget to stop/terminate the pod when done to avoid charges.

Troubleshooting

"unknown format: 65534" errors

WAV files use WAVE_FORMAT_EXTENSIBLE header. Fix with step 7 above (ffmpeg reconversion to pcm_s16le).

NumPy version conflict

pip install "numpy<2" --force-reinstall
pip install pesq --force-reinstall --no-binary pesq

Dataloader patch doesn't apply

The .patch file may not match due to trailing whitespace or blank lines. Use the Python string replacement approach — print the exact lines around sr_out = random and match them precisely.

Training on CPU (local testing only)

Patch train.py and inference.py to force CPU:

sed -i "s/device = torch.device('cuda') if args.use_cuda else torch.device('cpu')/device = torch.device('cpu')/" train.py
sed -i 's/torch.cuda.empty_cache()/#torch.cuda.empty_cache()/' train.py
sed -i 's/torch.cuda.set_device(args.local_rank)/#torch.cuda.set_device(args.local_rank)/' train.py

Cost estimate

GPU Price/hr Estimated training time (50 epochs, 3000 pairs) Total cost
A100 SXM 80GB $1.49/hr (RunPod) ~15 min ~$0.40
A40 48GB $0.40/hr (RunPod) ~30 min ~$0.20
MacBook Air M2 (CPU) Free ~12 hours $0

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors