Skip to content

Add processors for downloading HiFiTTS2 dataset#115

Merged
karpnv merged 7 commits into
mainfrom
hifitts2
Jun 4, 2025
Merged

Add processors for downloading HiFiTTS2 dataset#115
karpnv merged 7 commits into
mainfrom
hifitts2

Conversation

@rlangman

@rlangman rlangman commented May 5, 2025

Copy link
Copy Markdown
Collaborator

Adds processors needed to download HiFiTTS-2. The input to the processors will be two files (e.g. manifest_22khz and chapters_22khz) that the user downloads from another location:

Example command:

python /home/NeMo-speech-data-processor/main.py \
    --config-path="/home/NeMo-speech-data-processor/dataset_configs/english/hifitts2" \
    --config-name="config_22khz.yaml" \
    workspace_dir="/home/hifitts2" \
    max_workers=8

This PR also contains a generic processor to estimate bandwidth of audio, which was used in creating HiFiTTS-2. It is not part of the downloading pipeline itself, as it is already precomputed and provided in the dataset manifest.

Example command:

python /home/NeMo-speech-data-processor/main.py \
    --config-path="/home/NeMo-speech-data-processor/dataset_configs/english/hifitts2" \
    --config-name="config_bandwidth.yaml" \
    workspace_dir="/home/hifitts2" \
    audio_dir_name="audio_22khz" \
    input_manifest_file="manifest_22khz.json" \
    max_workers=8

@rlangman rlangman requested a review from karpnv May 5, 2025 16:47
@rlangman rlangman force-pushed the hifitts2 branch 2 times, most recently from e2b7c6a to 5ba4562 Compare May 5, 2025 18:50
Signed-off-by: Ryan <rlangman@nvidia.com>

@karpnv karpnv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grea work! Please add 2 end-to-end tests for 2 new pipelines and mention new processors in
https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/docs/src/sdp/api.rst

@karpnv

karpnv commented May 7, 2025

Copy link
Copy Markdown
Collaborator

Since we usually skip a download processor in the end-to-end test, you can just test RemovedFailedChapters processor in the unit test

Signed-off-by: Ryan <rlangman@nvidia.com>
@rlangman

Copy link
Copy Markdown
Collaborator Author

Grea work! Please add 2 end-to-end tests for 2 new pipelines and mention new processors in https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/docs/src/sdp/api.rst

I tried to add an end to end test, and more documentation. Let me know if there are any issues, or if I need to upload test data.

I also added a generic bandwidth estimation script, which was used when creating the dataset (but running it during the dataset download is not needed).

Signed-off-by: Ryan <rlangman@nvidia.com>
@Jorjeous

Copy link
Copy Markdown
Collaborator

"/home/runner/work/NeMo-speech-data-processor/NeMo-speech-data-processor/test_data/english/hifitts2/manifest_filtered_22khz.json"
This file is being used but seems that tests starts before it's being created, lets fix this

@Jorjeous

Copy link
Copy Markdown
Collaborator

Ways to solve:
1)change test config so the file is being created firstly
2) you can read this file from code itself
3) or upload to aws and download in process

@Jorjeous

Copy link
Copy Markdown
Collaborator

Docs ok, errors expected as pages not exist yet

@Jorjeous

Copy link
Copy Markdown
Collaborator

Also lacking files:
/home/runner/work/NeMo-speech-data-processor/NeMo-speech-data-processor/test_data/english/hifitts2/test_data_reference_bandwidth.json
/home/runner/work/NeMo-speech-data-processor/NeMo-speech-data-processor/test_data/english/hifitts2/manifest_filtered_22khz.json
/home/runner/work/NeMo-speech-data-processor/NeMo-speech-data-processor/test_data/english/hifitts2/errors_22khz.json'
/home/runner/work/NeMo-speech-data-processor/NeMo-speech-data-processor/test_data/english/hifitts2/manifest_22khz.json

@Jorjeous

Copy link
Copy Markdown
Collaborator

Could you please rewrite class descriptions in this format? (codeblock with example at the end)

class CreateInitialManifestYTC(BaseParallelProcessor):
"""A processor class for creating initial manifest files for a TTS dataset.

It takes a manifest file containing audio file paths and resamples them to a target
sample rate and format, while creating a new manifest file with the updated paths.

Args:
    input_format (str): Format of the input audio files
    resampled_audio_dir (str): Directory where resampled audio files will be saved
    target_sample_rate (int): Desired sample rate for the output audio files
    target_format (str): Desired format for the output audio files
    target_nchannels (int): Desired number of channels for the output audio files

Returns:
    The same data as in the input manifest, but with resampled audio files and
    updated audio file paths.

Example:
    .. code-block:: yaml

        - _target_: sdp.processors.datasets.ytc.create_initial_manifest.CreateInitialManifestYTC
          input_manifest_file: ${workspace_dir}/manifest.json
          output_manifest_file: ${workspace_dir}/manifest_resampled.json
"""

@Jorjeous Jorjeous left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When mentioned above comments is addressed every thing is good to go!
Nice!

Signed-off-by: Ryan <rlangman@nvidia.com>
@rlangman

Copy link
Copy Markdown
Collaborator Author

Could you please rewrite class descriptions in this format? (codeblock with example at the end)

class CreateInitialManifestYTC(BaseParallelProcessor): """A processor class for creating initial manifest files for a TTS dataset.

It takes a manifest file containing audio file paths and resamples them to a target
sample rate and format, while creating a new manifest file with the updated paths.

Args:
    input_format (str): Format of the input audio files
    resampled_audio_dir (str): Directory where resampled audio files will be saved
    target_sample_rate (int): Desired sample rate for the output audio files
    target_format (str): Desired format for the output audio files
    target_nchannels (int): Desired number of channels for the output audio files

Returns:
    The same data as in the input manifest, but with resampled audio files and
    updated audio file paths.

Example:
    .. code-block:: yaml

        - _target_: sdp.processors.datasets.ytc.create_initial_manifest.CreateInitialManifestYTC
          input_manifest_file: ${workspace_dir}/manifest.json
          output_manifest_file: ${workspace_dir}/manifest_resampled.json
"""

Added Example section in documentation.

@rlangman

Copy link
Copy Markdown
Collaborator Author

Also lacking files: /home/runner/work/NeMo-speech-data-processor/NeMo-speech-data-processor/test_data/english/hifitts2/test_data_reference_bandwidth.json /home/runner/work/NeMo-speech-data-processor/NeMo-speech-data-processor/test_data/english/hifitts2/manifest_filtered_22khz.json /home/runner/work/NeMo-speech-data-processor/NeMo-speech-data-processor/test_data/english/hifitts2/errors_22khz.json' /home/runner/work/NeMo-speech-data-processor/NeMo-speech-data-processor/test_data/english/hifitts2/manifest_22khz.json

I have setup the data on S3 so it passes when I run locally. Will see if it passes in the automated PR tests.

@rlangman rlangman requested a review from karpnv June 2, 2025 15:48
Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: Ryan <rlangman@nvidia.com>

@Jorjeous Jorjeous left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread dataset_configs/english/hifitts2/config_44khz.yaml
@karpnv karpnv merged commit 487310d into main Jun 4, 2025
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants