Closes #221 | Add Dataloader NUS SMS Corpus#596
Closes #221 | Add Dataloader NUS SMS Corpus#596akhdanfadh wants to merge 2 commits intoSEACrowd:masterfrom
Conversation
raileymontalan
left a comment
There was a problem hiding this comment.
Hi @akhdanfadh,
- Please run
make check_fileto fix the small spacing issues. - I am getting the error message
KeyError: '$'when trying to load the dataset. Please advise.
jensan-1
left a comment
There was a problem hiding this comment.
Hello @akhdanfadh .
Tested and LGTM here. Please respond to the comments by @raileymontalan, at least make sure to run the make check_file command to ensure all the space problems are cleared.
|
I've run the
@raileymontalan could you give your test result? |
Hi @akhdanfadh, I am using a MacBook, so the issue could be related to this. Please see the error message here: |
|
@raileymontalan I'm not sure about the macbook issue since I able to test the code in both Ubuntu and MacOS as well (see image below). Since the error is KeyError, I'm guessing it is about the python itself(?), or something in your environment.
|
|
Hi @raileymontalan, can you try running it on Linux-based OS? When I tried on Mac, it gave the same error as yours, but I managed to run it without any issues on the server. |
|
Hi @raileymontalan, a friendly reminder to review once you have the time. 👍 |
|
Hi @raileymontalan, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then. cc: @akhdanfadh |
in my end, the data generated has the key of Prob adding additional conditions of creating |
| def xml_element_to_dict(self, element: ET.Element) -> Dict: | ||
| """Converts an xml element to a dictionary.""" | ||
| element_dict = {} | ||
|
|
||
| # add text with key '$', attributes with '@' prefix | ||
| element_dict["$"] = element.text | ||
| for attrib, value in element.attrib.items(): | ||
| element_dict[f"@{attrib}"] = value | ||
|
|
||
| # recursively | ||
| for child in element: | ||
| child_dict = self.xml_element_to_dict(child) | ||
| element_dict[child.tag] = child_dict | ||
|
|
||
| return element_dict |
There was a problem hiding this comment.
something like this:
| def xml_element_to_dict(self, element: ET.Element) -> Dict: | |
| """Converts an xml element to a dictionary.""" | |
| element_dict = {} | |
| # add text with key '$', attributes with '@' prefix | |
| element_dict["$"] = element.text | |
| for attrib, value in element.attrib.items(): | |
| element_dict[f"@{attrib}"] = value | |
| # recursively | |
| for child in element: | |
| child_dict = self.xml_element_to_dict(child) | |
| element_dict[child.tag] = child_dict | |
| return element_dict | |
| def xml_element_to_dict(self, element: ET.Element, root=True) -> Dict: | |
| """Converts an xml element to a dictionary.""" | |
| element_dict = {} | |
| # add text with key '$', attributes with '@' prefix | |
| if element.text: #avoiding appending None text which will alter the schema | |
| element_dict["$"] = element.text | |
| for attrib, value in element.attrib.items(): | |
| element_dict[f"@{attrib}"] = value | |
| # recursively | |
| for child in element: | |
| child_dict = self.xml_element_to_dict(child, root=False) | |
| element_dict[child.tag] = child_dict | |
| return element_dict |
| def xml_element_to_dict(self, element: ET.Element) -> Dict: | |
| """Converts an xml element to a dictionary.""" | |
| element_dict = {} | |
| # add text with key '$', attributes with '@' prefix | |
| element_dict["$"] = element.text | |
| for attrib, value in element.attrib.items(): | |
| element_dict[f"@{attrib}"] = value | |
| # recursively | |
| for child in element: | |
| child_dict = self.xml_element_to_dict(child) | |
| element_dict[child.tag] = child_dict | |
| return element_dict | |
| def xml_element_to_dict(self, element: ET.Element) -> Dict: | |
| """Converts an xml element to a dictionary.""" | |
| element_dict = {} | |
| # add text with key '$', attributes with '@' prefix | |
| element_dict["$"] = element.text | |
| for attrib, value in element.attrib.items(): | |
| element_dict[f"@{attrib}"] = value | |
| # recursively | |
| for child in element: | |
| child_dict = self.xml_element_to_dict(child) | |
| element_dict[child.tag] = child_dict | |
| return element_dict |
There was a problem hiding this comment.
@sabilmakbar How about, as a simple but ugly workaround, we just add $ attribute to each key?
There was a problem hiding this comment.
I actually can't test anything because everything seems to be working on my end, both mac and ubuntu. So I guess I need to pass this to someone.
There was a problem hiding this comment.
hmm prob the issue wasn't about the platform, but to the datasets versions. If I remember it correctly, newer datasets version needs assertions of schema generated from _generate_examples vs defined in _info
|
update: works for |
Still getting the same issues as before when testing on Mac. My |
|
Hi @akhdanfadh, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪 Thanks again! PS: If the issue still persists on MacOS and we cannot find a workaround, should we just wrap it up and add a note that it's only usable for Linux in the |
|
Hi @holylovenia , I would love to continue on remaining tasks. It is not an OS problem AFAIK for now. I'll try testing different package version some time later and see. Cheers! |



Closes #221
I implemented one config per language/subset. Thus, configs will look like this:
nus_sms_corpus_eng_source,nus_sms_corpus_cmn_seacrowd_ssp, etc. When testing, passnus_sms_corpus_<subset>to the--subset_idparameter.Checkbox
seacrowd/sea_datasets/my_dataset/my_dataset.py(please use only lowercase and underscore for dataset naming)._CITATION,_DATASETNAME,_DESCRIPTION,_HOMEPAGE,_LICENSE,_URLs,_SUPPORTED_TASKS,_SOURCE_VERSION, and_SEACROWD_VERSIONvariables._info(),_split_generators()and_generate_examples()in dataloader script.BUILDER_CONFIGSclass attribute is a list with at least oneSEACrowdConfigfor the source schema and one for a seacrowd schema.datasets.load_datasetfunction.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.