Add duration-safe MKA conversion + shard_from_audio improvements#47
Open
Add duration-safe MKA conversion + shard_from_audio improvements#47
Conversation
Support ingesting raw audio directories into WSDS shards via the new shard_from_audio_dir command. Update extract_index_for_shard to infer audio duration from metadata instead of requiring pre-computed fields, and broaden the torchcodec fallback to catch all exceptions (not just ImportError).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…hard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: tlebryk <43556997+tlebryk@users.noreply.github.com>
…safe conversion MKA files from WebRTC recordings contain timestamp gaps that cause ffmpeg to silently drop audio, producing shorter output files. The aresample=async=1 filter fills gaps with silence to preserve the original duration. Also adds .mka to shard_from_audio_dir supported extensions.
… belongs in consumer code
2929825 to
779ac08
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Improvements to
shard_from_audio_dirand related indexing utilities:shard_from_audio_direnhancements: configurable file sorting, key mapping output, key prefix support,.mkaextension supportextract_index_for_shardfixes: add type hints, supportrequire_audio_duration=Falsefor formats where duration isn't in metadata, flexible shard tuple/string inputinitpassthrough: propagaterequire_audio_durationflagwebdataset/tarfileimports, dedup code, restore_listbatch validationinit_indexfix: write shards toaudio/subdirectory solist_all_shards()can find themtqdmdependency: already used throughoutws_tools.pybut was missing fromrequirements.txtshard_from_audio_dir(key mapping, prefix, sorting, flush, large file skip)Test plan
shard_from_audio_dir