Add duration-safe MKA conversion + shard_from_audio improvements by tlebryk · Pull Request #47 · HumeAI/wsds

tlebryk · 2026-03-13T15:17:55Z

Summary

Improvements to shard_from_audio_dir and related indexing utilities:

shard_from_audio_dir enhancements: configurable file sorting, key mapping output, key prefix support, .mka extension support
extract_index_for_shard fixes: add type hints, support require_audio_duration=False for formats where duration isn't in metadata, flexible shard tuple/string input
init passthrough: propagate require_audio_duration flag
Cleanup: remove dead webdataset/tarfile imports, dedup code, restore _list batch validation
init_index fix: write shards to audio/ subdirectory so list_all_shards() can find them
Declare tqdm dependency: already used throughout ws_tools.py but was missing from requirements.txt
Test suite: comprehensive tests for shard_from_audio_dir (key mapping, prefix, sorting, flush, large file skip)

Test plan

New test suite for shard_from_audio_dir
Existing tests still pass

Support ingesting raw audio directories into WSDS shards via the new shard_from_audio_dir command. Update extract_index_for_shard to infer audio duration from metadata instead of requiring pre-computed fields, and broaden the torchcodec fallback to catch all exceptions (not just ImportError).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…hard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-authored-by: tlebryk <43556997+tlebryk@users.noreply.github.com>

…safe conversion MKA files from WebRTC recordings contain timestamp gaps that cause ffmpeg to silently drop audio, producing shorter output files. The aresample=async=1 filter fills gaps with silence to preserve the original duration. Also adds .mka to shard_from_audio_dir supported extensions.

… belongs in consumer code

Theo Lebryk and others added 11 commits March 16, 2026 12:16

Clean up shard_from_audio_dir: remove dead param, unused var, dedup

0c06442

Add test suite for shard_from_audio_dir

d904064

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Restore _list batch validation, add type hints to extract_index_for_s…

bd7ac17

…hard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sorting is configurable

4a9d25a

Initial plan

d49e1ad

Fix init_index to write shards to audio/ subdirectory

b7c7954

Co-authored-by: tlebryk <43556997+tlebryk@users.noreply.github.com>

Delete .gitignore

d6eb097

remove test that depnds on ipython

04e4717

Remove convert_mka_to_audio — MKA conversion with start_time handling…

779ac08

… belongs in consumer code

tlebryk force-pushed the feat/shard-from-audio-tests branch from 2929825 to 779ac08 Compare March 16, 2026 16:18

tlebryk requested a review from jpc March 16, 2026 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add duration-safe MKA conversion + shard_from_audio improvements#47

Add duration-safe MKA conversion + shard_from_audio improvements#47
tlebryk wants to merge 11 commits intomainfrom
feat/shard-from-audio-tests

tlebryk commented Mar 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tlebryk commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tlebryk commented Mar 13, 2026 •

edited

Loading