Skip to content

Add unified data gathering system with plugin architecture#27

Merged
GAInTheHouse merged 28 commits into
mainfrom
gxa/create-data
Mar 3, 2026
Merged

Add unified data gathering system with plugin architecture#27
GAInTheHouse merged 28 commits into
mainfrom
gxa/create-data

Conversation

@GAInTheHouse

Copy link
Copy Markdown
Owner

Introduces a modular data gathering system that replaces fragmented download scripts with a single, plugin-based pipeline.

Summary

  • Unified entry point: scripts/gather_data.py for all dataset downloads
  • Plugin architecture: HuggingFace, OpenSLR, and Git plugins with shared utilities
  • YAML registry: 15+ datasets configured in scripts/data_gatherer/dataset_registry.yaml
  • VoxPopuli opt-in: Excluded from --sources all (122GB, special setup); use --datasets voxpopuli when needed

Key changes

  • Added scripts/data_gatherer/ with plugins, manifest generators, and dataset registry
  • Added scripts/augment_audio.py for MUSAN/RIRS augmentation
  • Added scripts/data_gather.md as the main data gathering guide
  • Removed download_datasets.py and preprocess_data.py
  • Updated SETUP.md, environment.yml, and requirements.txt

shivangi221b and others added 26 commits February 11, 2026 17:54
…EFERENCE at root, no extra docs, no agent_evaluator)

Co-authored-by: Cursor <cursoragent@cursor.com>
- ✅ Created Dockerfile.training (GPU) and Dockerfile.training.cpu (CPU)
- ✅ Verified all dependencies: LoRA/PEFT, Wav2Vec2, PyTorch, audio processing
- ✅ Added Docker Compose configuration for training
- ✅ Created verification and build scripts
- ✅ Comprehensive documentation (quickstart, troubleshooting, completion report)
- ✅ CPU container built and verified successfully (3.05GB)
- ✅ GPU container ready for GCP deployment

All Week 1 tasks for Kavya completed. Training environment is reproducible and ready for GCP deployment.

Co-authored-by: Cursor <cursoragent@cursor.com>
…Docker docs

- Move run_comprehensive_evaluations.py to experiments/
- Move verify_evaluation_numbers.py to experiments/
- Move EVALUATION_VERIFICATION_SUMMARY.md to docs/ and expand explanation
- Delete redundant WEEK1_DOCKER_BUILD_SUMMARY.md (consolidated into docs/WEEK1_TRAINING_DOCKER.md)
- Delete redundant WEEK1_TRAINING_DOCKER_QUICKSTART.md (consolidated into docs/WEEK1_TRAINING_DOCKER.md)
- Add Quick Start section to docs/WEEK1_TRAINING_DOCKER.md
- Expand verification summary purpose and context explanation

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@GAInTheHouse GAInTheHouse self-assigned this Mar 2, 2026
Copilot AI review requested due to automatic review settings March 2, 2026 01:14

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the legacy dataset download/preprocessing scripts with a unified, registry-driven data gathering pipeline under scripts/data_gatherer/, using a plugin architecture for Hugging Face, OpenSLR, and Git sources and emitting standardized CSV manifests.

Changes:

  • Added a unified CLI orchestrator (scripts/data_gatherer/data_gather.py) plus a convenience wrapper (scripts/gather_data.py) and a YAML dataset registry.
  • Added source plugins (Hugging Face / OpenSLR / Git-LFS) and dataset-specific manifest generators.
  • Added augmentation utility (scripts/augment_audio.py) and updated setup/docs/dependencies; removed legacy download/preprocess scripts.

Reviewed changes

Copilot reviewed 24 out of 26 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
scripts/preprocess_data.py Removed legacy preprocessing pipeline script.
scripts/download_datasets.py Removed legacy dataset download/curation script.
scripts/gather_data.py Added wrapper CLI that forwards to the unified data gatherer.
scripts/data_gatherer/data_gather.py Added main orchestrator CLI for registry-driven downloads + manifest generation.
scripts/data_gatherer/dataset_registry.yaml Added central registry defining datasets and their source configs.
scripts/data_gatherer/dataset_utils.py Added shared utilities for manifests, audio metadata, naming, logging, etc.
scripts/data_gatherer/source_plugins/init.py Added DataSourcePlugin interface for plugin architecture.
scripts/data_gatherer/source_plugins/huggingface_plugin.py Added Hugging Face download + manifest generation implementation.
scripts/data_gatherer/source_plugins/openslr_plugin.py Added OpenSLR download/extract + manifest generation routing implementation.
scripts/data_gatherer/source_plugins/git_plugin.py Added Git/Git-LFS clone + manifest generation routing implementation.
scripts/data_gatherer/manifest_generators/init.py Added manifest generator package export list.
scripts/data_gatherer/manifest_generators/librispeech.py Added LibriSpeech manifest generator.
scripts/data_gatherer/manifest_generators/musan.py Added MUSAN manifest generator.
scripts/data_gatherer/manifest_generators/rirs.py Added RIRS_NOISES manifest generator.
scripts/data_gatherer/manifest_generators/st_aeds.py Added ST-AEDS manifest generator.
scripts/data_gatherer/manifest_generators/primock57.py Added PriMock57 manifest generator.
scripts/data_gatherer/init.py Added package marker + version.
scripts/data_gatherer/README.md Added detailed technical documentation for the new system.
scripts/data_gather.md Added main user-facing data gathering guide.
scripts/augment_audio.py Added standalone audio augmentation utility (MUSAN/RIRS + corruptions).
requirements.txt Updated dependencies for the new pipeline + VoxPopuli decoding.
environment.yml Added conda environment spec including git-lfs/ffmpeg and Python deps.
SETUP.md Added consolidated environment setup guide (conda/docker/manual).
README.md Updated repo docs to reference new pipeline and setup docs.
Dockerfile Added git-lfs installation/init to container image.
.gitignore Updated ignores for new data directories, manifests, and archives.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/data_gatherer/source_plugins/openslr_plugin.py
Comment thread scripts/data_gatherer/source_plugins/huggingface_plugin.py
Comment thread scripts/data_gatherer/README.md Outdated
Comment thread scripts/data_gather.md Outdated
Comment thread scripts/data_gatherer/source_plugins/openslr_plugin.py
Comment thread scripts/data_gatherer/source_plugins/openslr_plugin.py
Comment thread scripts/data_gatherer/manifest_generators/__init__.py Outdated
Comment thread scripts/data_gatherer/data_gather.py Outdated
Comment thread scripts/data_gatherer/data_gather.py Outdated
Comment thread scripts/data_gatherer/dataset_utils.py

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 27 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/data_gatherer/dataset_registry.yaml
Comment thread scripts/data_gatherer/source_plugins/huggingface_plugin.py
Comment thread scripts/data_gatherer/source_plugins/openslr_plugin.py Outdated
Comment thread scripts/data_gatherer/source_plugins/openslr_plugin.py
Comment thread scripts/data_gatherer/README.md Outdated
Comment thread scripts/data_gatherer/dataset_utils.py
Comment thread scripts/data_gather.md
Comment thread scripts/data_gatherer/README.md Outdated
@GAInTheHouse GAInTheHouse merged commit 9641853 into main Mar 3, 2026
3 checks passed
@GAInTheHouse GAInTheHouse deleted the gxa/create-data branch March 6, 2026 02:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants