-
Notifications
You must be signed in to change notification settings - Fork 45
Add TTS processing pipeline #100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
44 commits
Select commit
Hold shift + click to select a range
86c5496
Add TTS processing pipeline
fqian1107 ca7beea
Add tts segments processor and update all processors
fqian1107 416f8eb
Remove hallucination detector
fqian1107 d0ecb1f
Update docs to new format
fqian1107 13fe1f6
Fix id to audio_item_id
fqian1107 ba92b76
Add manifest example
fqian1107 30e74b8
Move ndjson install
fqian1107 845c09a
Add missing example
fqian1107 59ce903
Add necessary key in manifest
fqian1107 e796f04
Add input manifest file
fqian1107 376c4c8
Use main Dockerfile
fqian1107 65908bf
fix docs
Jorjeous 8c0d9c1
add req's installation
Jorjeous 0c4e643
add reqs installation
Jorjeous fabd79b
add reqs ans update init file so it will find desired processor
Jorjeous 0e38dad
moving imports into class
Jorjeous 7f92b0c
Empty commit
sushmitha-deva-09 085dbda
Add sdp Dockerfile and github workflow to run tests
sushmitha-deva-09 80af6f0
Remove tts installation in other test workflow
sushmitha-deva-09 3405b05
Add separate test for tts sdp and update dockerfile
sushmitha-deva-09 0fb98ba
Update base Dockerfile
sushmitha-deva-09 9a99465
Use secrets for docker tests
sushmitha-deva-09 0af5c1f
Fix tests
sushmitha-deva-09 3075ae7
Update tests and remove processor initialization in main
sushmitha-deva-09 5a8b119
Update requirements
sushmitha-deva-09 02f34fb
Update secret key name
sushmitha-deva-09 f58e415
Update token name in test script
sushmitha-deva-09 923d333
Update pyannote script to switch to cpu when gpu not found
sushmitha-deva-09 aae7923
Add shared memory to docker run
sushmitha-deva-09 264354d
Add docker cleanup after test run
sushmitha-deva-09 e0f5ec3
Modify default asr model to smaller one
sushmitha-deva-09 bc11cc8
Enhance logging and fix sdp test
sushmitha-deva-09 80d68d7
Add README for tts processors
fqian1107 7f5eace
Add all processors
fqian1107 92273d7
Update README.md and docstrings
sushmitha-deva-09 dc12b39
retrigger tests
Jorjeous f11ea40
fix docs tests by providing full path in rst
Jorjeous 422a526
update to full path in rst
Jorjeous c7c4cb9
update mock of libs
Jorjeous 02a0d28
Update sdp config
sushmitha-deva-09 fc735ec
Add language info to config in tests
sushmitha-deva-09 f654a00
update init and mock
Jorjeous ae996d3
attempt to fix "optiona ref"
Jorjeous 7639d15
Fix "optional"
Jorjeous File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| name: SDP TTS Docker Build and Test | ||
|
|
||
| on: | ||
| pull_request: | ||
| branches: [ "main" ] | ||
| workflow_dispatch: | ||
|
|
||
| permissions: | ||
| contents: read | ||
|
|
||
| jobs: | ||
| build-and-test: | ||
| runs-on: ubuntu-latest | ||
|
|
||
| steps: | ||
| - name: Checkout code | ||
| uses: actions/checkout@v3 | ||
|
|
||
| - name: Set up Docker Buildx | ||
| uses: docker/setup-buildx-action@v3 | ||
|
|
||
| - name: Build Docker image | ||
| run: | | ||
| docker build -t sdp-test-image:${{ github.sha }} -f docker/Dockerfile.tts_sdp . | ||
|
|
||
| - name: Run sdp tts tests | ||
| env: | ||
| AWS_SECRET_KEY: ${{ secrets.AWS_SECRET_KEY }} | ||
| AWS_ACCESS_KEY: ${{ secrets.AWS_ACCESS_KEY }} | ||
| HF_SECRET_KEY: ${{ secrets.HF_SECRET_KEY }} | ||
| CLEAN_UP_TMP_PATH: 1 | ||
| run: | | ||
| docker run --rm \ | ||
| -v ${{ github.workspace }}:/workspace \ | ||
| -w /workspace \ | ||
| --shm-size=4g \ | ||
| -e AWS_SECRET_KEY="${AWS_SECRET_KEY}" \ | ||
| -e AWS_ACCESS_KEY="${AWS_ACCESS_KEY}" \ | ||
| -e HF_SECRET_KEY="${HF_SECRET_KEY}" \ | ||
| -e CLEAN_UP_TMP_PATH="${CLEAN_UP_TMP_PATH}" \ | ||
| sdp-test-image:${{ github.sha }} \ | ||
| bash -c "python -m pytest tests/test_tts_sdp_end_to_end.py -v" | ||
|
|
||
| - name: Get test results | ||
| if: always() | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: test-results | ||
| path: | | ||
| pytest.xml | ||
| coverage.xml | ||
|
|
||
| - name: Docker cleanup | ||
| run: docker system prune -af |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,102 @@ | ||
| documentation: | | ||
| TTS data processing pipeline | ||
| ############################ | ||
|
|
||
| This pipeline processes YouTube Commons (YTC) data for text-to-speech (TTS) training. | ||
|
|
||
| The pipeline performs the following steps: | ||
| 1. Creates initial manifest by resampling audio to 16kHz mono WAV format | ||
| 2. Runs speaker diarization and overlap detection using pyannote | ||
| 3. Splits long audio segments | ||
| 4. Aligns text and audio using NeMo ASR models | ||
| 5. Joins split audio metadata back together | ||
| 6. Merges alignment and diarization information | ||
| 7. Performs inverse text normalization | ||
| 8. Calculates audio quality metrics using TorchSQUIM | ||
| 9. Estimates audio bandwidth | ||
| 10. Prepares TTS segments | ||
|
|
||
| Required inputs: | ||
|
|
||
| - input_manifest_file: Path to input manifest json file | ||
| - manifest must contain "audio_filepath" and "audio_item_id" fields | ||
| - example: {"audio_filepath": "path/to/raw/audio/file.wav", "audio_item_id": "some_unique_id"} | ||
|
|
||
| - hf_token: HuggingFace token for pyannote access | ||
| - data_split: Data split name (train/dev/test) | ||
| - workspace_dir: Directory for intermediate files | ||
| - language_short: 2-letter language code | ||
| - nemo_path: Path to NeMo installation | ||
| - final_manifest: Path for final output manifest | ||
|
|
||
| processors_to_run: all | ||
| data_split: ??? | ||
| workspace_dir: /tmp | ||
| language_short: ??? | ||
| input_manifest_file: ??? | ||
| final_manifest: ??? | ||
| nemo_path: ??? | ||
| resampled_audio_dir: /tmp/audio_resampled | ||
| hf_token: ??? | ||
|
Jorjeous marked this conversation as resolved.
|
||
| max_segment_length: 40 | ||
| device: cuda | ||
|
|
||
| processors: | ||
| - _target_: sdp.processors.datasets.ytc.create_initial_manifest.CreateInitialManifestYTC | ||
| input_manifest_file: ${input_manifest_file} | ||
| output_manifest_file: "${workspace_dir}/tts_processed/${data_split}_manifest_initial.json" | ||
| resampled_audio_dir: ${resampled_audio_dir} | ||
| input_format: "wav" | ||
| target_sample_rate: 16000 | ||
| target_format: "wav" | ||
| target_nchannels: 1 | ||
|
|
||
| - _target_: sdp.processors.tts.pyannote.PyAnnoteDiarizationAndOverlapDetection | ||
| hf_token: ${hf_token} | ||
| input_manifest_file: "${workspace_dir}/tts_processed/${data_split}_manifest_initial.json" | ||
| output_manifest_file: "${workspace_dir}/tts_processed/${data_split}_manifest_diarized.json" | ||
| max_length: ${max_segment_length} | ||
| device: ${device} | ||
|
|
||
| - _target_: sdp.processors.tts.split.SplitLongAudio | ||
| input_manifest_file: "${workspace_dir}/tts_processed/${data_split}_manifest_diarized.json" | ||
| output_manifest_file: "${workspace_dir}/tts_processed/${data_split}_manifest_split.json" | ||
| suggested_max_len: ${max_segment_length} | ||
| min_pause_len: 1 | ||
|
|
||
| - _target_: sdp.processors.tts.nemo_asr_align.NeMoASRAligner | ||
| input_manifest_file: "${workspace_dir}/tts_processed/${data_split}_manifest_split.json" | ||
| output_manifest_file: "${workspace_dir}/tts_processed/${data_split}_manifest_aligned.json" | ||
| parakeet: True | ||
| ctc: False | ||
| batch_size: 16 | ||
| device: ${device} | ||
|
|
||
| - _target_: sdp.processors.tts.split.JoinSplitAudioMetadata | ||
| input_manifest_file: "${workspace_dir}/tts_processed/${data_split}_manifest_aligned.json" | ||
| output_manifest_file: "${workspace_dir}/tts_processed/${data_split}_manifest_joined.json" | ||
|
|
||
| - _target_: sdp.processors.tts.merge_alignment_diarization.MergeAlignmentDiarization | ||
| input_manifest_file: "${workspace_dir}/tts_processed/${data_split}_manifest_joined.json" | ||
| output_manifest_file: "${workspace_dir}/tts_processed/${data_split}_manifest_merged.json" | ||
|
|
||
| - _target_: sdp.processors.tts.text.InverseTextNormalizationProcessor | ||
| input_manifest_file: "${workspace_dir}/tts_processed/${data_split}_manifest_merged.json" | ||
| output_manifest_file: "${workspace_dir}/tts_processed/${data_split}_manifest_ITN.json" | ||
| language: ${language_short} | ||
|
|
||
| - _target_: sdp.processors.tts.metrics.TorchSquimObjectiveQualityMetricsProcessor | ||
| input_manifest_file: "${workspace_dir}/tts_processed/${data_split}_manifest_ITN.json" | ||
| output_manifest_file: "${workspace_dir}/tts_processed/${data_split}_manifest_squim.json" | ||
| device: ${device} | ||
|
|
||
| - _target_: sdp.processors.tts.metrics.BandwidthEstimationProcessor | ||
| input_manifest_file: "${workspace_dir}/tts_processed/${data_split}_manifest_squim.json" | ||
| output_manifest_file: "${workspace_dir}/tts_processed/${data_split}_manifest_bandwidth.json" | ||
|
|
||
| - _target_: sdp.processors.tts.prepare_tts_segments.PrepareTTSSegmentsProcessor | ||
| input_manifest_file: "${workspace_dir}/tts_processed/${data_split}_manifest_bandwidth.json" | ||
| output_manifest_file: "${final_manifest}" | ||
| terminal_punct_marks: ".!?" | ||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| FROM pytorch/pytorch:2.4.1-cuda12.1-cudnn9-devel | ||
|
|
||
| ARG DEBIAN_FRONTEND=noninteractive | ||
|
|
||
| ENV TZ=America/Los_Angeles | ||
|
|
||
| # Install basics | ||
| RUN apt-get update && apt-get install -y --no-install-recommends \ | ||
| build-essential \ | ||
| bzip2 \ | ||
| ca-certificates \ | ||
| libsox-fmt-mp3 \ | ||
| cmake \ | ||
| curl \ | ||
| ffmpeg \ | ||
| g++ \ | ||
| sox \ | ||
| unzip \ | ||
| vim \ | ||
| wget | ||
|
|
||
| # Update pip | ||
| RUN pip install --upgrade pip | ||
|
|
||
| # Link all cudnn .so libraries for runtime | ||
| RUN ln -s /opt/conda/lib/python3.11/site-packages/nvidia/cudnn/include/cudnn*.h /usr/include/ | ||
| RUN mkdir -p /usr/local/cuda/lib64 | ||
| RUN ln -s /opt/conda/lib/python3.11/site-packages/nvidia/cudnn/lib/libcudnn*.so* /usr/local/cuda/lib64/ | ||
| ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH | ||
|
|
||
|
|
||
| # Copy NeMo SDP | ||
| WORKDIR /src | ||
| COPY . /src/NeMo-speech-data-processor | ||
| RUN rm -rf /src/NeMo-speech-data-processor/.git | ||
|
|
||
| # Install requirements | ||
| WORKDIR /src/NeMo-speech-data-processor | ||
| RUN pip install -r requirements/main.txt | ||
| RUN pip install -r requirements/tts.txt | ||
| RUN pip install flash-attn --no-build-isolation | ||
| RUN pip install https://github.com/LahiLuk/YouTokenToMe/archive/master.zip | ||
| RUN pip install megatron-core transformer_engine[pytorch] | ||
| RUN pip install nemo_toolkit['all']==2.1.0 | ||
|
|
||
| WORKDIR /src/NeMo-speech-data-processor |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| ndjson | ||
| transformers | ||
| accelerate | ||
| torchaudio | ||
| pyannote-audio | ||
| ffmpeg-python | ||
| whisperx==3.3.1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.