Closes #221 | Add Dataloader NUS SMS Corpus by akhdanfadh · Pull Request #596 · SEACrowd/seacrowd-datahub

akhdanfadh · 2024-04-01T15:17:10Z

Closes #221

I implemented one config per language/subset. Thus, configs will look like this: nus_sms_corpus_eng_source, nus_sms_corpus_cmn_seacrowd_ssp, etc. When testing, pass nus_sms_corpus_<subset> to the --subset_id parameter.

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

raileymontalan

Hi @akhdanfadh,

Please run make check_file to fix the small spacing issues.
I am getting the error message KeyError: '$' when trying to load the dataset. Please advise.

jensan-1

Hello @akhdanfadh .
Tested and LGTM here. Please respond to the comments by @raileymontalan, at least make sure to run the make check_file command to ensure all the space problems are cleared.

akhdanfadh · 2024-04-25T22:51:32Z

I've run the make check_file, please double-check.

I am getting the error message KeyError: '$' when trying to load the dataset. Please advise.

@raileymontalan could you give your test result?

raileymontalan · 2024-05-06T02:28:25Z

I've run the make check_file, please double-check.

I am getting the error message KeyError: '$' when trying to load the dataset. Please advise.

@raileymontalan could you give your test result?

Hi @akhdanfadh, I am using a MacBook, so the issue could be related to this. Please see the error message here:

(env-seacrowd) raileymontalan@Raileys-MacBook-Pro-2023 seacrowd-datahub % python -m tests.test_seacrowd seacrowd/sea_datasets/nus_sms_corpus/nus_sms_corpus.py --subset_id="nus_sms_corpus_eng"
INFO:__main__:args: Namespace(path='seacrowd/sea_datasets/nus_sms_corpus/nus_sms_corpus.py', schema=None, subset_id='nus_sms_corpus_eng', data_dir=None, use_auth_token=None)
INFO:__main__:self.PATH: seacrowd/sea_datasets/nus_sms_corpus/nus_sms_corpus.py
INFO:__main__:self.SUBSET_ID: nus_sms_corpus_eng
INFO:__main__:self.SCHEMA: None
INFO:__main__:self.DATA_DIR: None
INFO:__main__:Checking for _SUPPORTED_TASKS ...
module seacrowd.sea_datasets.nus_sms_corpus.nus_sms_corpus
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.SELF_SUPERVISED_PRETRAINING: 'SSP'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'SSP'}
INFO:__main__:schemas_to_check: {'SSP'}
INFO:__main__:Checking load_dataset with config name nus_sms_corpus_eng_source
/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py:2483: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for nus_sms_corpus contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/nus_sms_corpus/nus_sms_corpus.py
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Generating train split: 0 examples [00:01, ? examples/s]
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
Run all tests that check:
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1743, in _prepare_split_single
    example = self.info.features.encode_example(record) if self.info.features is not None else record
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1878, in encode_example
    return encode_nested_example(self, example)
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1243, in encode_nested_example
    {
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1244, in <dictcomp>
    k: encode_nested_example(sub_schema, sub_obj, level=level + 1)
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1243, in encode_nested_example
    {
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1244, in <dictcomp>
    k: encode_nested_example(sub_schema, sub_obj, level=level + 1)
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1243, in encode_nested_example
    {
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1243, in <dictcomp>
    {
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 326, in zip_dict
    yield key, tuple(d[key] for d in dicts)
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 326, in <genexpr>
    yield key, tuple(d[key] for d in dicts)
KeyError: '$'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/raileymontalan/Documents/seacrowd-datahub/tests/test_seacrowd.py", line 134, in setUp
    self.dataset_source = datasets.load_dataset(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py", line 2549, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare
    self._download_and_prepare(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1767, in _download_and_prepare
    super()._download_and_prepare(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1100, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1605, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1762, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

----------------------------------------------------------------------
Ran 1 test in 3.052s

FAILED (errors=1)

akhdanfadh · 2024-05-09T02:28:27Z

@raileymontalan I'm not sure about the macbook issue since I able to test the code in both Ubuntu and MacOS as well (see image below). Since the error is KeyError, I'm guessing it is about the python itself(?), or something in your environment.

holylovenia · 2024-05-12T13:09:09Z

Hi @raileymontalan, can you try running it on Linux-based OS? When I tried on Mac, it gave the same error as yours, but I managed to run it without any issues on the server.

holylovenia · 2024-05-21T08:19:20Z

Hi @raileymontalan, a friendly reminder to review once you have the time. 👍

holylovenia · 2024-05-30T04:40:10Z

Hi @raileymontalan, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

cc: @akhdanfadh

sabilmakbar · 2024-05-31T11:41:48Z

Hi @raileymontalan, can you try running it on Linux-based OS? When I tried on Mac, it gave the same error as yours, but I managed to run it without any issues on the server.
Do you have a different versions of datasets on the Mac vs Server? prob that was the case

in my end, the data generated has the key of $ generated iteratively, which is a bit unexpected to the feature list.

Prob adding additional conditions of creating $ cols only if the element.text is available (not None) is a best workaround for now

sabilmakbar · 2024-05-31T11:43:38Z

seacrowd/sea_datasets/nus_sms_corpus/nus_sms_corpus.py

+    def xml_element_to_dict(self, element: ET.Element) -> Dict:
+        """Converts an xml element to a dictionary."""
+        element_dict = {}
+
+        # add text with key '$', attributes with '@' prefix
+        element_dict["$"] = element.text
+        for attrib, value in element.attrib.items():
+            element_dict[f"@{attrib}"] = value
+
+        # recursively
+        for child in element:
+            child_dict = self.xml_element_to_dict(child)
+            element_dict[child.tag] = child_dict
+
+        return element_dict


something like this:

Suggested change

def xml_element_to_dict(self, element: ET.Element) -> Dict:

"""Converts an xml element to a dictionary."""

element_dict = {}

# add text with key '$', attributes with '@' prefix

element_dict["$"] = element.text

for attrib, value in element.attrib.items():

element_dict[f"@{attrib}"] = value

# recursively

for child in element:

child_dict = self.xml_element_to_dict(child)

element_dict[child.tag] = child_dict

return element_dict

def xml_element_to_dict(self, element: ET.Element, root=True) -> Dict:

"""Converts an xml element to a dictionary."""

element_dict = {}

# add text with key '$', attributes with '@' prefix

if element.text: #avoiding appending None text which will alter the schema

element_dict["$"] = element.text

for attrib, value in element.attrib.items():

element_dict[f"@{attrib}"] = value

# recursively

for child in element:

child_dict = self.xml_element_to_dict(child, root=False)

element_dict[child.tag] = child_dict

return element_dict

Suggested change

def xml_element_to_dict(self, element: ET.Element) -> Dict:

"""Converts an xml element to a dictionary."""

element_dict = {}

# add text with key '$', attributes with '@' prefix

element_dict["$"] = element.text

for attrib, value in element.attrib.items():

element_dict[f"@{attrib}"] = value

# recursively

for child in element:

child_dict = self.xml_element_to_dict(child)

element_dict[child.tag] = child_dict

return element_dict

def xml_element_to_dict(self, element: ET.Element) -> Dict:

"""Converts an xml element to a dictionary."""

element_dict = {}

# add text with key '$', attributes with '@' prefix

element_dict["$"] = element.text

for attrib, value in element.attrib.items():

element_dict[f"@{attrib}"] = value

# recursively

for child in element:

child_dict = self.xml_element_to_dict(child)

element_dict[child.tag] = child_dict

return element_dict

@sabilmakbar How about, as a simple but ugly workaround, we just add $ attribute to each key?

I actually can't test anything because everything seems to be working on my end, both mac and ubuntu. So I guess I need to pass this to someone.

hmm prob the issue wasn't about the platform, but to the datasets versions. If I remember it correctly, newer datasets version needs assertions of schema generated from _generate_examples vs defined in _info

sabilmakbar · 2024-05-31T11:56:25Z

update: works for eng subset, but still looking the cause for cmn subset

raileymontalan · 2024-06-03T04:58:03Z

Hi @raileymontalan, can you try running it on Linux-based OS? When I tried on Mac, it gave the same error as yours, but I managed to run it without any issues on the server.
Do you have a different versions of datasets on the Mac vs Server? prob that was the case

in my end, the data generated has the key of $ generated iteratively, which is a bit unexpected to the feature list.

Prob adding additional conditions of creating $ cols only if the element.text is available (not None) is a best workaround for now

Still getting the same issues as before when testing on Mac. My datasets version is 2.16.1

holylovenia · 2024-07-08T06:09:44Z

Hi @akhdanfadh, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. ☺️

Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪

Thanks again!

PS: If the issue still persists on MacOS and we cannot find a workaround, should we just wrap it up and add a note that it's only usable for Linux in the _DESCRIPTION?

cc: @raileymontalan @sabilmakbar

akhdanfadh · 2024-07-10T08:51:02Z

Hi @holylovenia , I would love to continue on remaining tasks. It is not an OS problem AFAIK for now. I'll try testing different package version some time later and see. Cheers!

init commit

c820424

akhdanfadh requested review from MJonibek, SamuelCahyawijaya, danjohnvelasco, gentaiscool, holylovenia, jamesjaya, jensan-1, ljvmiranda921, sabilmakbar, tellarin and yongzx as code owners April 1, 2024 15:17

holylovenia requested review from raileymontalan and removed request for MJonibek, SamuelCahyawijaya, danjohnvelasco, gentaiscool, holylovenia, jamesjaya, ljvmiranda921, sabilmakbar, tellarin and yongzx April 17, 2024 07:27

holylovenia assigned raileymontalan and jensan-1 Apr 17, 2024

raileymontalan requested changes Apr 22, 2024

View reviewed changes

jensan-1 approved these changes Apr 23, 2024

View reviewed changes

run make check_file

7968101

sabilmakbar reviewed May 31, 2024

View reviewed changes

github-actions bot added the need-fu-pr label Jun 18, 2024

github-actions bot removed the need-fu-pr label Jul 9, 2024

github-actions bot added the need-fu-pr label Jul 25, 2024

Conversation

akhdanfadh commented Apr 1, 2024

Checkbox

Uh oh!

raileymontalan left a comment

Choose a reason for hiding this comment

Uh oh!

jensan-1 left a comment

Choose a reason for hiding this comment

Uh oh!

akhdanfadh commented Apr 25, 2024

Uh oh!

raileymontalan commented May 6, 2024

Uh oh!

akhdanfadh commented May 9, 2024

Uh oh!

holylovenia commented May 12, 2024

Uh oh!

holylovenia commented May 21, 2024

Uh oh!

holylovenia commented May 30, 2024

Uh oh!

sabilmakbar commented May 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sabilmakbar May 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akhdanfadh May 31, 2024

Choose a reason for hiding this comment

Uh oh!

akhdanfadh May 31, 2024

Choose a reason for hiding this comment

Uh oh!

sabilmakbar May 31, 2024

Choose a reason for hiding this comment

Uh oh!

sabilmakbar commented May 31, 2024

Uh oh!

raileymontalan commented Jun 3, 2024

Uh oh!

holylovenia commented Jul 8, 2024

Uh oh!

akhdanfadh commented Jul 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sabilmakbar commented May 31, 2024 •

edited

Loading

sabilmakbar May 31, 2024 •

edited

Loading