Add processors for downloading HiFiTTS2 dataset by rlangman · Pull Request #115 · NVIDIA/NeMo-speech-data-processor

rlangman · 2025-05-05T16:47:17Z

Adds processors needed to download HiFiTTS-2. The input to the processors will be two files (e.g. manifest_22khz and chapters_22khz) that the user downloads from another location:

Example command:

python /home/NeMo-speech-data-processor/main.py \
    --config-path="/home/NeMo-speech-data-processor/dataset_configs/english/hifitts2" \
    --config-name="config_22khz.yaml" \
    workspace_dir="/home/hifitts2" \
    max_workers=8

This PR also contains a generic processor to estimate bandwidth of audio, which was used in creating HiFiTTS-2. It is not part of the downloading pipeline itself, as it is already precomputed and provided in the dataset manifest.

Example command:

python /home/NeMo-speech-data-processor/main.py \
    --config-path="/home/NeMo-speech-data-processor/dataset_configs/english/hifitts2" \
    --config-name="config_bandwidth.yaml" \
    workspace_dir="/home/hifitts2" \
    audio_dir_name="audio_22khz" \
    input_manifest_file="manifest_22khz.json" \
    max_workers=8

Signed-off-by: Ryan <rlangman@nvidia.com>

karpnv

Grea work! Please add 2 end-to-end tests for 2 new pipelines and mention new processors in
https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/docs/src/sdp/api.rst

karpnv · 2025-05-07T10:56:29Z

Since we usually skip a download processor in the end-to-end test, you can just test RemovedFailedChapters processor in the unit test

Signed-off-by: Ryan <rlangman@nvidia.com>

rlangman · 2025-05-12T22:48:14Z

Grea work! Please add 2 end-to-end tests for 2 new pipelines and mention new processors in https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/docs/src/sdp/api.rst

I tried to add an end to end test, and more documentation. Let me know if there are any issues, or if I need to upload test data.

I also added a generic bandwidth estimation script, which was used when creating the dataset (but running it during the dataset download is not needed).

Signed-off-by: Ryan <rlangman@nvidia.com>

Jorjeous · 2025-05-14T11:45:36Z

"/home/runner/work/NeMo-speech-data-processor/NeMo-speech-data-processor/test_data/english/hifitts2/manifest_filtered_22khz.json"
This file is being used but seems that tests starts before it's being created, lets fix this

Jorjeous · 2025-05-14T11:53:03Z

Ways to solve:
1)change test config so the file is being created firstly
2) you can read this file from code itself
3) or upload to aws and download in process

Jorjeous · 2025-05-26T07:20:08Z

Docs ok, errors expected as pages not exist yet

Jorjeous · 2025-05-26T07:58:45Z

Also lacking files:
/home/runner/work/NeMo-speech-data-processor/NeMo-speech-data-processor/test_data/english/hifitts2/test_data_reference_bandwidth.json
/home/runner/work/NeMo-speech-data-processor/NeMo-speech-data-processor/test_data/english/hifitts2/manifest_filtered_22khz.json
/home/runner/work/NeMo-speech-data-processor/NeMo-speech-data-processor/test_data/english/hifitts2/errors_22khz.json'
/home/runner/work/NeMo-speech-data-processor/NeMo-speech-data-processor/test_data/english/hifitts2/manifest_22khz.json

Jorjeous · 2025-05-26T13:13:05Z

Could you please rewrite class descriptions in this format? (codeblock with example at the end)

class CreateInitialManifestYTC(BaseParallelProcessor):
"""A processor class for creating initial manifest files for a TTS dataset.

It takes a manifest file containing audio file paths and resamples them to a target
sample rate and format, while creating a new manifest file with the updated paths.

Args:
    input_format (str): Format of the input audio files
    resampled_audio_dir (str): Directory where resampled audio files will be saved
    target_sample_rate (int): Desired sample rate for the output audio files
    target_format (str): Desired format for the output audio files
    target_nchannels (int): Desired number of channels for the output audio files

Returns:
    The same data as in the input manifest, but with resampled audio files and
    updated audio file paths.

Example:
    .. code-block:: yaml

        - _target_: sdp.processors.datasets.ytc.create_initial_manifest.CreateInitialManifestYTC
          input_manifest_file: ${workspace_dir}/manifest.json
          output_manifest_file: ${workspace_dir}/manifest_resampled.json
"""

Jorjeous

When mentioned above comments is addressed every thing is good to go!
Nice!

Signed-off-by: Ryan <rlangman@nvidia.com>

rlangman · 2025-05-27T19:56:09Z

Could you please rewrite class descriptions in this format? (codeblock with example at the end)

class CreateInitialManifestYTC(BaseParallelProcessor): """A processor class for creating initial manifest files for a TTS dataset.

It takes a manifest file containing audio file paths and resamples them to a target
sample rate and format, while creating a new manifest file with the updated paths.

Args:
    input_format (str): Format of the input audio files
    resampled_audio_dir (str): Directory where resampled audio files will be saved
    target_sample_rate (int): Desired sample rate for the output audio files
    target_format (str): Desired format for the output audio files
    target_nchannels (int): Desired number of channels for the output audio files

Returns:
    The same data as in the input manifest, but with resampled audio files and
    updated audio file paths.

Example:
    .. code-block:: yaml

        - _target_: sdp.processors.datasets.ytc.create_initial_manifest.CreateInitialManifestYTC
          input_manifest_file: ${workspace_dir}/manifest.json
          output_manifest_file: ${workspace_dir}/manifest_resampled.json
"""

Added Example section in documentation.

rlangman · 2025-05-27T19:58:11Z

Also lacking files: /home/runner/work/NeMo-speech-data-processor/NeMo-speech-data-processor/test_data/english/hifitts2/test_data_reference_bandwidth.json /home/runner/work/NeMo-speech-data-processor/NeMo-speech-data-processor/test_data/english/hifitts2/manifest_filtered_22khz.json /home/runner/work/NeMo-speech-data-processor/NeMo-speech-data-processor/test_data/english/hifitts2/errors_22khz.json' /home/runner/work/NeMo-speech-data-processor/NeMo-speech-data-processor/test_data/english/hifitts2/manifest_22khz.json

I have setup the data on S3 so it passes when I run locally. Will see if it passes in the automated PR tests.

Signed-off-by: Ryan <rlangman@nvidia.com>

Jorjeous

LGTM

rlangman requested a review from karpnv May 5, 2025 16:47

rlangman force-pushed the hifitts2 branch 2 times, most recently from e2b7c6a to 5ba4562 Compare May 5, 2025 18:50

Add processors for downloading HiFiTTS2 dataset

2ef44f8

Signed-off-by: Ryan <rlangman@nvidia.com>

rlangman force-pushed the hifitts2 branch from 5ba4562 to 2ef44f8 Compare May 5, 2025 19:50

karpnv requested changes May 7, 2025

View reviewed changes

Add bandwidth estimation processor from HiFiTTS-2

2f63993

Signed-off-by: Ryan <rlangman@nvidia.com>

rlangman force-pushed the hifitts2 branch from 3add85e to 2f63993 Compare May 12, 2025 22:16

Merge branch 'main' into hifitts2

5684a5d

rlangman force-pushed the hifitts2 branch from 94e8f93 to 5684a5d Compare May 12, 2025 22:38

Fix bandwidth documentation

aa643f8

Signed-off-by: Ryan <rlangman@nvidia.com>

rlangman force-pushed the hifitts2 branch from 30aa258 to aa643f8 Compare May 12, 2025 23:50

karpnv requested review from Jorjeous and lilithgrigoryan May 13, 2025 21:00

Jorjeous reviewed May 26, 2025

View reviewed changes

Fix tests, add example to docstrings

64fb25e

Signed-off-by: Ryan <rlangman@nvidia.com>

rlangman requested a review from karpnv June 2, 2025 15:48

Fix exception handling for URLError

2fd724e

Signed-off-by: Ryan <rlangman@nvidia.com>

rlangman force-pushed the hifitts2 branch from aafb831 to 2fd724e Compare June 2, 2025 18:28

Add 44kHz config test

ed4124f

Signed-off-by: Ryan <rlangman@nvidia.com>

Jorjeous approved these changes Jun 4, 2025

View reviewed changes

Comment thread dataset_configs/english/hifitts2/config_44khz.yaml

karpnv merged commit 487310d into main Jun 4, 2025
9 of 10 checks passed

Conversation

rlangman commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karpnv left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karpnv commented May 7, 2025

Uh oh!

rlangman commented May 12, 2025

Uh oh!

Jorjeous commented May 14, 2025

Uh oh!

Jorjeous commented May 14, 2025

Uh oh!

Jorjeous commented May 26, 2025

Uh oh!

Jorjeous commented May 26, 2025

Uh oh!

Jorjeous commented May 26, 2025

Uh oh!

Jorjeous left a comment

Choose a reason for hiding this comment

Uh oh!

rlangman commented May 27, 2025

Uh oh!

rlangman commented May 27, 2025

Uh oh!

Jorjeous left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rlangman commented May 5, 2025 •

edited

Loading

karpnv left a comment •

edited

Loading