Skip to content

Make file sorting optional in shard_from_audio_dir for large corpora#42

Closed
Copilot wants to merge 4 commits intofeat/shard-from-audiofrom
copilot/sub-pr-40
Closed

Make file sorting optional in shard_from_audio_dir for large corpora#42
Copilot wants to merge 4 commits intofeat/shard-from-audiofrom
copilot/sub-pr-40

Conversation

Copy link
Copy Markdown

Copilot AI commented Feb 10, 2026

Addresses performance feedback from #40: sorted(input_dir.rglob(...)) materializes and sorts all file paths up-front, causing unnecessary memory/time overhead for large audio datasets.

Changes

  • Add sort_files parameter (default: False) to shard_from_audio_dir
    • When False: iterates lazily over files without sorting (memory efficient)
    • When True: materializes full file list and sorts (deterministic ordering)
  • Update test coverage: add test_sort_files_parameter to verify both behaviors

Example

# Default: efficient for large corpora, non-deterministic order
shard_from_audio_dir("./audio", "./shards")

# Opt-in to deterministic ordering when needed
shard_from_audio_dir("./audio", "./shards", sort_files=True)

The generator-based approach eliminates the memory spike from loading all paths before processing begins.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 3 commits February 10, 2026 23:55
…orpora

Co-authored-by: tlebryk <43556997+tlebryk@users.noreply.github.com>
Co-authored-by: tlebryk <43556997+tlebryk@users.noreply.github.com>
Co-authored-by: tlebryk <43556997+tlebryk@users.noreply.github.com>
Copilot AI changed the title [WIP] WIP on addressing feedback for feat/shard from audio Make file sorting optional in shard_from_audio_dir for large corpora Feb 10, 2026
Copilot AI requested a review from tlebryk February 10, 2026 23:57
@tlebryk tlebryk closed this Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants