Skip to content

Add duration-safe MKA conversion + shard_from_audio improvements#47

Open
tlebryk wants to merge 11 commits intomainfrom
feat/shard-from-audio-tests
Open

Add duration-safe MKA conversion + shard_from_audio improvements#47
tlebryk wants to merge 11 commits intomainfrom
feat/shard-from-audio-tests

Conversation

@tlebryk
Copy link
Copy Markdown
Contributor

@tlebryk tlebryk commented Mar 13, 2026

Summary

Improvements to shard_from_audio_dir and related indexing utilities:

  • shard_from_audio_dir enhancements: configurable file sorting, key mapping output, key prefix support, .mka extension support
  • extract_index_for_shard fixes: add type hints, support require_audio_duration=False for formats where duration isn't in metadata, flexible shard tuple/string input
  • init passthrough: propagate require_audio_duration flag
  • Cleanup: remove dead webdataset/tarfile imports, dedup code, restore _list batch validation
  • init_index fix: write shards to audio/ subdirectory so list_all_shards() can find them
  • Declare tqdm dependency: already used throughout ws_tools.py but was missing from requirements.txt
  • Test suite: comprehensive tests for shard_from_audio_dir (key mapping, prefix, sorting, flush, large file skip)

Test plan

  • New test suite for shard_from_audio_dir
  • Existing tests still pass

Theo Lebryk and others added 11 commits March 16, 2026 12:16
Support ingesting raw audio directories into WSDS shards via the new
shard_from_audio_dir command. Update extract_index_for_shard to infer
audio duration from metadata instead of requiring pre-computed fields,
and broaden the torchcodec fallback to catch all exceptions (not just
ImportError).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…hard

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: tlebryk <43556997+tlebryk@users.noreply.github.com>
…safe conversion

MKA files from WebRTC recordings contain timestamp gaps that cause
ffmpeg to silently drop audio, producing shorter output files.
The aresample=async=1 filter fills gaps with silence to preserve
the original duration. Also adds .mka to shard_from_audio_dir
supported extensions.
@tlebryk tlebryk force-pushed the feat/shard-from-audio-tests branch from 2929825 to 779ac08 Compare March 16, 2026 16:18
@tlebryk tlebryk requested a review from jpc March 16, 2026 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants