Add unified data gathering system with plugin architecture#27
Merged
Conversation
…EFERENCE at root, no extra docs, no agent_evaluator) Co-authored-by: Cursor <cursoragent@cursor.com>
- ✅ Created Dockerfile.training (GPU) and Dockerfile.training.cpu (CPU) - ✅ Verified all dependencies: LoRA/PEFT, Wav2Vec2, PyTorch, audio processing - ✅ Added Docker Compose configuration for training - ✅ Created verification and build scripts - ✅ Comprehensive documentation (quickstart, troubleshooting, completion report) - ✅ CPU container built and verified successfully (3.05GB) - ✅ GPU container ready for GCP deployment All Week 1 tasks for Kavya completed. Training environment is reproducible and ready for GCP deployment. Co-authored-by: Cursor <cursoragent@cursor.com>
…Docker docs - Move run_comprehensive_evaluations.py to experiments/ - Move verify_evaluation_numbers.py to experiments/ - Move EVALUATION_VERIFICATION_SUMMARY.md to docs/ and expand explanation - Delete redundant WEEK1_DOCKER_BUILD_SUMMARY.md (consolidated into docs/WEEK1_TRAINING_DOCKER.md) - Delete redundant WEEK1_TRAINING_DOCKER_QUICKSTART.md (consolidated into docs/WEEK1_TRAINING_DOCKER.md) - Add Quick Start section to docs/WEEK1_TRAINING_DOCKER.md - Expand verification summary purpose and context explanation Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…luating what we had in main
There was a problem hiding this comment.
Pull request overview
This PR replaces the legacy dataset download/preprocessing scripts with a unified, registry-driven data gathering pipeline under scripts/data_gatherer/, using a plugin architecture for Hugging Face, OpenSLR, and Git sources and emitting standardized CSV manifests.
Changes:
- Added a unified CLI orchestrator (
scripts/data_gatherer/data_gather.py) plus a convenience wrapper (scripts/gather_data.py) and a YAML dataset registry. - Added source plugins (Hugging Face / OpenSLR / Git-LFS) and dataset-specific manifest generators.
- Added augmentation utility (
scripts/augment_audio.py) and updated setup/docs/dependencies; removed legacy download/preprocess scripts.
Reviewed changes
Copilot reviewed 24 out of 26 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/preprocess_data.py | Removed legacy preprocessing pipeline script. |
| scripts/download_datasets.py | Removed legacy dataset download/curation script. |
| scripts/gather_data.py | Added wrapper CLI that forwards to the unified data gatherer. |
| scripts/data_gatherer/data_gather.py | Added main orchestrator CLI for registry-driven downloads + manifest generation. |
| scripts/data_gatherer/dataset_registry.yaml | Added central registry defining datasets and their source configs. |
| scripts/data_gatherer/dataset_utils.py | Added shared utilities for manifests, audio metadata, naming, logging, etc. |
| scripts/data_gatherer/source_plugins/init.py | Added DataSourcePlugin interface for plugin architecture. |
| scripts/data_gatherer/source_plugins/huggingface_plugin.py | Added Hugging Face download + manifest generation implementation. |
| scripts/data_gatherer/source_plugins/openslr_plugin.py | Added OpenSLR download/extract + manifest generation routing implementation. |
| scripts/data_gatherer/source_plugins/git_plugin.py | Added Git/Git-LFS clone + manifest generation routing implementation. |
| scripts/data_gatherer/manifest_generators/init.py | Added manifest generator package export list. |
| scripts/data_gatherer/manifest_generators/librispeech.py | Added LibriSpeech manifest generator. |
| scripts/data_gatherer/manifest_generators/musan.py | Added MUSAN manifest generator. |
| scripts/data_gatherer/manifest_generators/rirs.py | Added RIRS_NOISES manifest generator. |
| scripts/data_gatherer/manifest_generators/st_aeds.py | Added ST-AEDS manifest generator. |
| scripts/data_gatherer/manifest_generators/primock57.py | Added PriMock57 manifest generator. |
| scripts/data_gatherer/init.py | Added package marker + version. |
| scripts/data_gatherer/README.md | Added detailed technical documentation for the new system. |
| scripts/data_gather.md | Added main user-facing data gathering guide. |
| scripts/augment_audio.py | Added standalone audio augmentation utility (MUSAN/RIRS + corruptions). |
| requirements.txt | Updated dependencies for the new pipeline + VoxPopuli decoding. |
| environment.yml | Added conda environment spec including git-lfs/ffmpeg and Python deps. |
| SETUP.md | Added consolidated environment setup guide (conda/docker/manual). |
| README.md | Updated repo docs to reference new pipeline and setup docs. |
| Dockerfile | Added git-lfs installation/init to container image. |
| .gitignore | Updated ignores for new data directories, manifests, and archives. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 25 out of 27 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
shivangi221b
approved these changes
Mar 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduces a modular data gathering system that replaces fragmented download scripts with a single, plugin-based pipeline.
Summary
scripts/gather_data.pyfor all dataset downloadsscripts/data_gatherer/dataset_registry.yaml--sources all(122GB, special setup); use--datasets voxpopuliwhen neededKey changes
scripts/data_gatherer/with plugins, manifest generators, and dataset registryscripts/augment_audio.pyfor MUSAN/RIRS augmentationscripts/data_gather.mdas the main data gathering guidedownload_datasets.pyandpreprocess_data.py