Conversation
jensan-1
left a comment
There was a problem hiding this comment.
Hi @bp-high
Thanks for implemeting the dataloader.
There are several issues I would like to address:
- The dataset is reported to be split to
train, dev, test(numbers2935, 979, 980in the paper, however it is onlytrain, test(numbers3914, 980in this dataloader. Could you recheck the link, and implement the dataset split as the one in the paper? This implementation fix would also apply to theSplitGenerator. - Please revise the minor fixes to conform to our dataset format. The suggestions are provided as shown. Also, please be sure to run
make check_filecommand.
Waiting for the update!
| label_classes = ['O', 'S-org:religious', 'E-tv_show', 'I-soi', 'E-last', 'E-fund', 'B-country', 'B-port', 'B-airport', 'I-title', 'E-stock_exchange', 'I-restaurant', 'I-station', 'I-org:political', 'I-animal_species', 'I-index', 'E-latitude', 'B-law', 'S-middle', 'I-product:food', 'E-animal_species', 'I-goverment', 'S-city', 'E-longtitude', 'I-song', 'I-award', 'E-event:others', 'B-jargon', 'S-facility:other', 'S-money', 'I-stadium', 'E-space', 'S-animate', 'E-station', 'S-percent', 'E-country', 'I-year', 'S-district', 'E-org:religious', 'E-continent', 'I-food:ingredient', 'I-person', 'B-electronics', 'B-museum', 'B-product:drug', 'B-month', 'B-loc:others', 'B-org:religious', 'S-season', 'I-continent', 'E-norp:political', 'E-airport', 'B-money', 'S-role', 'B-cardinal', 'B-film', 'B-hotel', 'B-unit', 'E-energy', 'I-jargon', 'I-date', 'S-norp:others', 'S-country', 'E-film', 'I-energy', 'B-norp:political', 'E-index', 'I-concert', 'I-sciname', 'I-product:drug', 'I-org:other', 'S-animal_species', 'I-hospital', 'B-nationality', 'S-law', 'S-disease', 'I-latitude', 'E-award', 'B-periodic', 'I-roadname', 'S-jargon', 'E-city', 'S-person', 'S-date', 'E-goverment', 'B-stadium', 'S-religion', 'B-address', 'E-band', 'S-continent', 'I-fold', 'E-stadium', 'S-band', 'E-middle', 'B-org:edu', 'I-distance', 'I-island', 'S-hospital', 'S-loc:others', 'S-book', 'I-war', 'I-norp:political', 'E-title', 'S-war', 'S-org:edu', 'I-nationality', 'E-restaurant', 'S-province', 'E-war', 'I-quantity', 'S-orgcorp', 'E-person', 'E-day', 'E-disease', 'I-last', 'S-product:drug', 'S-space', 'E-product:drug', 'I-country', 'E-weight', 'B-fold', 'E-sciname', 'E-building', 'E-duration', 'B-longtitude', 'E-religion', 'S-soi', 'E-woa', 'B-date', 'I-city', 'I-role', 'E-org:edu', 'I-norp:others', 'B-title', 'I-namemod', 'E-speed', 'S-nicknametitle', 'E-mult', 'E-army', 'I-day', 'S-duration', 'B-time', 'I-rel', 'B-language', 'B-ocean', 'I-woa', 'E-animate', 'I-facility:other', 'E-psudoname', 'E-hotel', 'I-bridge', 'I-port', 'S-address', 'B-war', 'E-mountian', 'B-psudoname', 'I-sports_team', 'B-last', 'B-animate', 'B-latitude', 'E-cardinal', 'S-roadname', 'I-sub_district', 'E-food:ingredient', 'E-nickname', 'E-port', 'B-district', 'B-sports_team', 'E-org:political', 'I-loc:others', 'E-ocean', 'E-month', 'S-bridge', 'B-fund', 'I-money', 'I-state', 'S-unit', 'B-distance', 'B-food:ingredient', 'E-unit', 'S-namemod', 'S-airport', 'I-cardinal', 'S-sub_district', 'S-day', 'B-woa', 'S-army', 'B-tv_show', 'E-bridge', 'E-loc:others', 'S-sciname', 'I-law', 'B-space', 'B-religion', 'I-film', 'E-quantity', 'S-org:other', 'I-sports_event', 'I-event:others', 'S-firstname', 'I-band', 'I-weight', 'E-song', 'I-book', 'E-sports_event', 'S-weapon', 'I-space', 'B-speed', 'B-event:others', 'E-soi', 'I-org:edu', 'E-product:food', 'S-year', 'B-concert', 'B-province', 'B-sciname', 'I-vehicle', 'E-percent', 'I-percent', 'E-org:other', 'I-orgcorp', 'B-org:other', 'E-year', 'I-media', 'S-food:ingredient', 'I-electronics', 'E-hospital', 'I-psudoname', 'I-month', 'E-distance', 'B-duration', 'S-nationality', 'I-firstname', 'B-namemod', 'S-building', 'S-psudoname', 'B-restaurant', 'I-org:religious', 'B-vehicle', 'B-sports_event', 'B-book', 'B-temperature', 'E-norp:others', 'E-time', 'B-org:political', 'E-state', 'I-periodic', 'S-tv_show', 'S-product:food', 'S-postcode', 'S-mult', 'I-disease', 'S-month', 'B-weight', 'B-station', 'E-book', 'E-sub_district', 'E-orgcorp', 'S-fund', 'I-nickname', 'B-band', 'S-stock_exchange', 'S-natural_disaster', 'S-electronics', 'E-sports_team', 'E-museum', 'I-fund', 'B-river', 'B-rel', 'I-unit', 'I-time', 'S-mountian', 'S-port', 'I-temperature', 'I-airport', 'S-god', 'B-year', 'I-army', 'B-index', 'B-natural_disaster', 'E-roadname', 'E-namemod', 'S-game', 'I-hotel', 'B-quantity', 'B-person', 'B-mountian', 'S-river', 'E-vehicle', 'S-media', 'S-sports_team', 'I-river', 'I-mountian', 'B-hospital', 'S-state', 'B-mult', 'S-film', 'B-soi', 'E-role', 'S-org:political', 'S-station', 'E-jargon', 'S-last', 'S-ocean', 'S-quantity', 'S-award', 'E-law', 'I-middle', 'E-firstname', 'E-facility:other', 'I-address', 'I-religion', 'S-time', 'E-periodic', 'S-vehicle', 'B-nickname', 'S-fold', 'B-role', 'E-rel', 'S-goverment', 'I-district', 'B-song', 'E-money', 'E-island', 'I-duration', 'E-electronics', 'B-sub_district', 'E-language', 'E-temperature', 'B-building', 'S-sports_event', 'E-concert', 'B-percent', 'S-island', 'I-museum', 'B-island', 'B-media', 'S-language', 'B-stock_exchange', 'S-title', 'E-natural_disaster', 'B-animal_species', 'B-facility:other', 'B-state', 'E-address', 'I-animate', 'S-stadium', 'I-province', 'B-city', 'B-army', 'I-weapon', 'B-middle', 'S-cardinal', 'E-date', 'S-event:others', 'S-periodic', 'I-speed', 'B-day', 'B-goverment', 'E-media', 'B-norp:others', 'S-nickname', 'E-district', 'I-natural_disaster', 'E-fold', 'I-language', 'S-rel', 'B-product:food', 'B-bridge', 'I-longtitude', 'B-orgcorp', 'B-award', 'E-nationality', 'E-weapon', 'E-province', 'I-tv_show', 'B-energy', 'I-building', 'B-roadname', 'B-continent', 'B-firstname', 'I-stock_exchange', 'S-norp:political', 'S-weight', 'S-restaurant', 'B-weapon', 'S-song', 'B-disease', 'E-river'] | ||
| BUILDER_CONFIGS = [ | ||
| SEACrowdConfig( | ||
| name="thai_nner_source", |
There was a problem hiding this comment.
| name="thai_nner_source", | |
| name=f"{_DATASETNAME}_source", |
| version=SOURCE_VERSION, | ||
| description=" Thai NNER source schema", | ||
| schema="source", | ||
| subset_id="thai_nner", |
There was a problem hiding this comment.
| subset_id="thai_nner", | |
| subset_id=f"{_DATASETNAME}", |
| version=SEACROWD_VERSION, | ||
| description="Thai NNER SEACrowd schema", | ||
| schema="seacrowd_seq_label", | ||
| subset_id="thai_nner", |
There was a problem hiding this comment.
| subset_id="thai_nner", | |
| subset_id=f"{_DATASETNAME}", |
| else: | ||
| raise ValueError(f"Invalid config: {self.config.name}") | ||
|
|
||
| # This template is based on the following template from the datasets package: |
There was a problem hiding this comment.
Remove the comment out part here and afterwards.
|
Hi @ & @, may I know if you are still working on this PR? |
I've been waiting for the changes for 2 weeks. Please fix or review immediately. |
|
@bp-high would you addressing @jen-santoso's review? |
|
Yes @gentaiscool on it. Will push the recommended changes by eod today. |
|
Hi @ & @, may I know if you are still working on this PR? |
|
hello @bp-high Would you mind addressing the changes I requested? I would like to get this PR soon, and also so that we all can proceed with completing this dataloader's PR. |
|
Hi @gentaiscool, @jen-santoso & @bp-high, may I know if you are still working on this PR? |
|
still working on this will address the changes in a few hours |
|
@jen-santoso Have added the necessary changes as suggested in the review and also run the make check_file command.
I found that the dataset present in the CONLL format (https://drive.google.com/drive/folders/1FIoSmj-JCRxaDTIbGsiuMhaWERmFUCWq )contains only train and test splits as also mentioned in the paper: Although I found that there is data present in json format which contains the dev split(https://drive.google.com/drive/folders/1F0fqRdnGMjryp8dAToPc7OEBJlA9Xdi7) when I tried to build it into the CONLL BIOES format tags that was provided by the authors it did not came out identical to the original conll data provided by the authors. I have mailed the authors to ask about this. |
|
@bp-high Thank you for the code updates and the investigation of the dataset. |
Agree. Let's wait for @gentaiscool's review. |
|
Replacing @gentaiscool with @patrickamadeus for review. |
| def _custom_conll_loader(file_path): | ||
| # Read file | ||
| data = open(file_path, "r", encoding="utf-8").readlines() | ||
|
|
||
| # Prepare buffer | ||
| dataset = [] | ||
| sentence, seq_label = [], [] | ||
| for line in data: | ||
| if len(line.strip()) > 0: | ||
| # Handle lines with a single field | ||
| if "\t" in line: | ||
| token, label = line[:-1].split("\t") | ||
| elif "\t" not in line and " " in line: | ||
| line_split = line.split() | ||
| token = line_split[0] | ||
| label = line_split[1:] | ||
| else: | ||
| token, label = line.strip(), "O" # Default label if not provided | ||
|
|
||
| sentence.append(token) | ||
| seq_label.append(label) | ||
| else: | ||
| dataset.append({"sentence": sentence, "label": seq_label}) | ||
| sentence = [] | ||
| seq_label = [] | ||
| return dataset |
There was a problem hiding this comment.
Hi @bp-high! Thanks for the contribution!
I only want to address a minor bug that might happen here since the implementation hasn't passed the test case.
THIS IS YOUR `row['sentence']`
['5', '_', 'มี', '.ค.', '_']
THIS IS YOUR `row['label']`
[['B-date', 'S-day', 'O', 'O', 'O', 'O', 'O', 'O'], ['I-date', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['I-date', 'B-month', 'O', 'O', 'O', 'O', 'O', 'O'], ['I-date', 'E-month', 'O', 'O', 'O', 'O', 'O', 'O'], ['I-date', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
I think there is a slight mistake within the loader function resulting in the captured label becomes list of list instead of list of string, sequentially matched with each token from the sentence. Hence, it prompts TypeError: '<=' not supported between instances of 'int' and 'list' during the check.
Could you please do revision on this? Other than that, LGTM!
There was a problem hiding this comment.
The loader is specially modified to give in the format so that we get all labels(upto depth 8) of each word in a corresponding list marked for it.
| if self.config.schema == "source": | ||
| features = datasets.Features({"index": datasets.Value("string"), "tokens": [datasets.Value("string")], "ner_tag": [datasets.Value("string")]}) | ||
|
|
||
| elif self.config.schema == "seacrowd_seq_label": | ||
| features = schemas.seq_label_features(self.label_classes) |
There was a problem hiding this comment.
Thank you for the clarification @bp-high . Since that's the case, could we revise the feature schema so the implementation satisfies the testcases?
| if self.config.schema == "source": | |
| features = datasets.Features({"index": datasets.Value("string"), "tokens": [datasets.Value("string")], "ner_tag": [datasets.Value("string")]}) | |
| elif self.config.schema == "seacrowd_seq_label": | |
| features = schemas.seq_label_features(self.label_classes) | |
| features = datasets.Features({"index": datasets.Value("string"), "tokens": [datasets.Value("string")], "ner_tag": [[datasets.Value("string")]]}) | |
| # and for SEACrowd seq_label |
same goes for the SEACrowd schema
| label_classes = [ | ||
| "O", | ||
| "S-org:religious", | ||
| "E-tv_show", | ||
| "I-soi", | ||
| "E-last", | ||
| "E-fund", | ||
| "B-country", |
There was a problem hiding this comment.
While I'm checking, I also found this error ValueError: Invalid string class label S-museum, so far I noticed there are
"S-hotel",
"B-postcode",
"E-postcode",
"B-season",
"E-season",
....
and probably more labels that aren't present in label_classes.
Could you please make sure the labels are complete and run python3 -m tests.test_seacrowd seacrowd/sea_datasets/thai_nner/thai_nner.py again?
Thank you!
|
A friendly reminder for @bp-high to address @patrickamadeus's suggestions. |
|
Hi @bp-high, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) by 30 May, so it'd be great if we could wrap up the reviewing and merge this PR before then. |
|
Hi @bp-high, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then. |
|
Hi @bp-high, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪 Thanks again! cc: @patrickamadeus |


Closes #95
Checkbox
seacrowd/sea_datasets/my_dataset/my_dataset.py(please use only lowercase and underscore for dataset naming)._CITATION,_DATASETNAME,_DESCRIPTION,_HOMEPAGE,_LICENSE,_URLs,_SUPPORTED_TASKS,_SOURCE_VERSION, and_SEACROWD_VERSIONvariables._info(),_split_generators()and_generate_examples()in dataloader script.BUILDER_CONFIGSclass attribute is a list with at least oneSEACrowdConfigfor the source schema and one for a seacrowd schema.datasets.load_datasetfunction.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.