Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -75,14 +75,20 @@ jobs:
pip install nemo-toolkit[asr,nlp]==1.23.0
pip install nemo_text_processing
pip install -r requirements/huggingface.txt
pip install certifi #this needed to avoid problems with certificates [COORAL]
export SSL_CERT_FILE=$(python -m certifi)
python -m pip cache purge


- name: Run all tests
env:
AWS_SECRET_KEY: ${{ secrets.AWS_SECRET_KEY }}
AWS_ACCESS_KEY: ${{ secrets.AWS_ACCESS_KEY }}
CLEAN_UP_TMP_PATH: 1
run: |
wget https://uit.stanford.edu/sites/default/files/2023/10/11/incommon-rsa-ca2.pem #downloading cert manually [for CORAL]
sudo cp incommon-rsa-ca2.pem /usr/local/share/ca-certificates/incommon-rsa-server-ca-2.crt # [cert for CORAL]
sudo update-ca-certificates # [cert for CORAL]
set -o pipefail # this will make sure next line returns non-0 exit code if tests fail
python -m pytest tests/ --junitxml=pytest.xml --ignore=tests/test_tts_sdp_end_to_end.py --cov-report=term-missing:skip-covered --cov=sdp --durations=30 -rs | tee pytest-coverage.txt

Expand Down
2 changes: 1 addition & 1 deletion dataset_configs/english/coraal/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ documentation: |
This config performs the following data processing.

1. Downloads CORAAL data based on the
`official file list <http://lingtools.uoregon.edu/coraal/coraal_download_list.txt>`_.
`official file list <https://lingtools.uoregon.edu/coraal/coraal_download_list.txt>`_. #Official mirror link
There are a couple of errors in the links there, which are fixed in our code.
2. Drops all utterances which contain only pauses. Set ``drop_pauses=False`` to undo.
3. Groups all consecutive segments from the same speaker until 20 seconds duration
Expand Down
5 changes: 5 additions & 0 deletions docs/src/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,3 +189,8 @@ def setup(app):
]
# nitpick_ignore_regex = [('py:class', '*')]

#adding this especially for coraal, temporary
linkcheck_ignore = [
r'https://lingtools\.uoregon\.edu/coraal/coraal_download_list\.txt',
]
# https://lingtools.uoregon.edu/coraal/coraal_download_list.txt
2 changes: 1 addition & 1 deletion requirements/main.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ python-docx
pydub
dask
distributed

jiwer>=3.1.0,<4.0.0
# toloka-kit # Temporarily disabled due to Toloka's technical pause; keep as reference for past and future API support
# for some processers, additionally https://github.com/NVIDIA/NeMo is required
# for some processers, additionally nemo_text_processing is required
Expand Down
10 changes: 5 additions & 5 deletions sdp/processors/datasets/coraal/create_initial_manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,15 +31,15 @@ def get_coraal_url_list():
There are a few mistakes in the official url list that are fixed here.
Can be overridden by tests to select a subset of urls.
"""
dataset_url = "http://lingtools.uoregon.edu/coraal/coraal_download_list.txt"
dataset_url = "https://lingtools.uoregon.edu/coraal/coraal_download_list.txt"
urls = []
for file_url in urllib.request.urlopen(dataset_url):
file_url = file_url.decode('utf-8').strip()
# fixing known errors in the urls
if file_url == 'http://lingtools.uoregon.edu/coraal/les/2021.07/LES_metadata_2018.10.06.txt':
file_url = 'http://lingtools.uoregon.edu/coraal/les/2021.07/LES_metadata_2021.07.txt'
if file_url == 'http://lingtools.uoregon.edu/coraal/vld/2021.07/VLD_metadata_2018.10.06.txt':
file_url = 'http://lingtools.uoregon.edu/coraal/vld/2021.07/VLD_metadata_2021.07.txt'
if file_url == 'https://lingtools.uoregon.edu/coraal/les/2021.07/LES_metadata_2018.10.06.txt':
file_url = 'https://lingtools.uoregon.edu/coraal/les/2021.07/LES_metadata_2021.07.txt'
if file_url == 'https://lingtools.uoregon.edu/coraal/vld/2021.07/VLD_metadata_2018.10.06.txt':
file_url = 'https://lingtools.uoregon.edu/coraal/vld/2021.07/VLD_metadata_2021.07.txt'
urls.append(file_url)
return urls

Expand Down
Loading