Closes #95 Adds Thai NNER Dataset by bp-high · Pull Request #326 · SEACrowd/seacrowd-datahub

bp-high · 2024-01-15T11:09:29Z

Closes #95

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

jensan-1

Hi @bp-high
Thanks for implemeting the dataloader.

There are several issues I would like to address:

The dataset is reported to be split to train, dev, test (numbers 2935, 979, 980 in the paper, however it is only train, test (numbers 3914, 980 in this dataloader. Could you recheck the link, and implement the dataset split as the one in the paper? This implementation fix would also apply to the SplitGenerator.
Please revise the minor fixes to conform to our dataset format. The suggestions are provided as shown. Also, please be sure to run make check_file command.

Waiting for the update!

seacrowd/sea_datasets/thai_nner/thai_nner.py

jensan-1 · 2024-01-22T09:34:50Z

seacrowd/sea_datasets/thai_nner/thai_nner.py

+    label_classes = ['O', 'S-org:religious', 'E-tv_show', 'I-soi', 'E-last', 'E-fund', 'B-country', 'B-port', 'B-airport', 'I-title', 'E-stock_exchange', 'I-restaurant', 'I-station', 'I-org:political', 'I-animal_species', 'I-index', 'E-latitude', 'B-law', 'S-middle', 'I-product:food', 'E-animal_species', 'I-goverment', 'S-city', 'E-longtitude', 'I-song', 'I-award', 'E-event:others', 'B-jargon', 'S-facility:other', 'S-money', 'I-stadium', 'E-space', 'S-animate', 'E-station', 'S-percent', 'E-country', 'I-year', 'S-district', 'E-org:religious', 'E-continent', 'I-food:ingredient', 'I-person', 'B-electronics', 'B-museum', 'B-product:drug', 'B-month', 'B-loc:others', 'B-org:religious', 'S-season', 'I-continent', 'E-norp:political', 'E-airport', 'B-money', 'S-role', 'B-cardinal', 'B-film', 'B-hotel', 'B-unit', 'E-energy', 'I-jargon', 'I-date', 'S-norp:others', 'S-country', 'E-film', 'I-energy', 'B-norp:political', 'E-index', 'I-concert', 'I-sciname', 'I-product:drug', 'I-org:other', 'S-animal_species', 'I-hospital', 'B-nationality', 'S-law', 'S-disease', 'I-latitude', 'E-award', 'B-periodic', 'I-roadname', 'S-jargon', 'E-city', 'S-person', 'S-date', 'E-goverment', 'B-stadium', 'S-religion', 'B-address', 'E-band', 'S-continent', 'I-fold', 'E-stadium', 'S-band', 'E-middle', 'B-org:edu', 'I-distance', 'I-island', 'S-hospital', 'S-loc:others', 'S-book', 'I-war', 'I-norp:political', 'E-title', 'S-war', 'S-org:edu', 'I-nationality', 'E-restaurant', 'S-province', 'E-war', 'I-quantity', 'S-orgcorp', 'E-person', 'E-day', 'E-disease', 'I-last', 'S-product:drug', 'S-space', 'E-product:drug', 'I-country', 'E-weight', 'B-fold', 'E-sciname', 'E-building', 'E-duration', 'B-longtitude', 'E-religion', 'S-soi', 'E-woa', 'B-date', 'I-city', 'I-role', 'E-org:edu', 'I-norp:others', 'B-title', 'I-namemod', 'E-speed', 'S-nicknametitle', 'E-mult', 'E-army', 'I-day', 'S-duration', 'B-time', 'I-rel', 'B-language', 'B-ocean', 'I-woa', 'E-animate', 'I-facility:other', 'E-psudoname', 'E-hotel', 'I-bridge', 'I-port', 'S-address', 'B-war', 'E-mountian', 'B-psudoname', 'I-sports_team', 'B-last', 'B-animate', 'B-latitude', 'E-cardinal', 'S-roadname', 'I-sub_district', 'E-food:ingredient', 'E-nickname', 'E-port', 'B-district', 'B-sports_team', 'E-org:political', 'I-loc:others', 'E-ocean', 'E-month', 'S-bridge', 'B-fund', 'I-money', 'I-state', 'S-unit', 'B-distance', 'B-food:ingredient', 'E-unit', 'S-namemod', 'S-airport', 'I-cardinal', 'S-sub_district', 'S-day', 'B-woa', 'S-army', 'B-tv_show', 'E-bridge', 'E-loc:others', 'S-sciname', 'I-law', 'B-space', 'B-religion', 'I-film', 'E-quantity', 'S-org:other', 'I-sports_event', 'I-event:others', 'S-firstname', 'I-band', 'I-weight', 'E-song', 'I-book', 'E-sports_event', 'S-weapon', 'I-space', 'B-speed', 'B-event:others', 'E-soi', 'I-org:edu', 'E-product:food', 'S-year', 'B-concert', 'B-province', 'B-sciname', 'I-vehicle', 'E-percent', 'I-percent', 'E-org:other', 'I-orgcorp', 'B-org:other', 'E-year', 'I-media', 'S-food:ingredient', 'I-electronics', 'E-hospital', 'I-psudoname', 'I-month', 'E-distance', 'B-duration', 'S-nationality', 'I-firstname', 'B-namemod', 'S-building', 'S-psudoname', 'B-restaurant', 'I-org:religious', 'B-vehicle', 'B-sports_event', 'B-book', 'B-temperature', 'E-norp:others', 'E-time', 'B-org:political', 'E-state', 'I-periodic', 'S-tv_show', 'S-product:food', 'S-postcode', 'S-mult', 'I-disease', 'S-month', 'B-weight', 'B-station', 'E-book', 'E-sub_district', 'E-orgcorp', 'S-fund', 'I-nickname', 'B-band', 'S-stock_exchange', 'S-natural_disaster', 'S-electronics', 'E-sports_team', 'E-museum', 'I-fund', 'B-river', 'B-rel', 'I-unit', 'I-time', 'S-mountian', 'S-port', 'I-temperature', 'I-airport', 'S-god', 'B-year', 'I-army', 'B-index', 'B-natural_disaster', 'E-roadname', 'E-namemod', 'S-game', 'I-hotel', 'B-quantity', 'B-person', 'B-mountian', 'S-river', 'E-vehicle', 'S-media', 'S-sports_team', 'I-river', 'I-mountian', 'B-hospital', 'S-state', 'B-mult', 'S-film', 'B-soi', 'E-role', 'S-org:political', 'S-station', 'E-jargon', 'S-last', 'S-ocean', 'S-quantity', 'S-award', 'E-law', 'I-middle', 'E-firstname', 'E-facility:other', 'I-address', 'I-religion', 'S-time', 'E-periodic', 'S-vehicle', 'B-nickname', 'S-fold', 'B-role', 'E-rel', 'S-goverment', 'I-district', 'B-song', 'E-money', 'E-island', 'I-duration', 'E-electronics', 'B-sub_district', 'E-language', 'E-temperature', 'B-building', 'S-sports_event', 'E-concert', 'B-percent', 'S-island', 'I-museum', 'B-island', 'B-media', 'S-language', 'B-stock_exchange', 'S-title', 'E-natural_disaster', 'B-animal_species', 'B-facility:other', 'B-state', 'E-address', 'I-animate', 'S-stadium', 'I-province', 'B-city', 'B-army', 'I-weapon', 'B-middle', 'S-cardinal', 'E-date', 'S-event:others', 'S-periodic', 'I-speed', 'B-day', 'B-goverment', 'E-media', 'B-norp:others', 'S-nickname', 'E-district', 'I-natural_disaster', 'E-fold', 'I-language', 'S-rel', 'B-product:food', 'B-bridge', 'I-longtitude', 'B-orgcorp', 'B-award', 'E-nationality', 'E-weapon', 'E-province', 'I-tv_show', 'B-energy', 'I-building', 'B-roadname', 'B-continent', 'B-firstname', 'I-stock_exchange', 'S-norp:political', 'S-weight', 'S-restaurant', 'B-weapon', 'S-song', 'B-disease', 'E-river']
+    BUILDER_CONFIGS = [
+        SEACrowdConfig(
+            name="thai_nner_source",


Suggested change

name="thai_nner_source",

name=f"{_DATASETNAME}_source",

seacrowd/sea_datasets/thai_nner/thai_nner.py

jensan-1 · 2024-01-22T09:35:07Z

seacrowd/sea_datasets/thai_nner/thai_nner.py

+            version=SOURCE_VERSION,
+            description=" Thai NNER source schema",
+            schema="source",
+            subset_id="thai_nner",


Suggested change

subset_id="thai_nner",

subset_id=f"{_DATASETNAME}",

seacrowd/sea_datasets/thai_nner/thai_nner.py

jensan-1 · 2024-01-22T09:35:33Z

seacrowd/sea_datasets/thai_nner/thai_nner.py

+            version=SEACROWD_VERSION,
+            description="Thai NNER SEACrowd schema",
+            schema="seacrowd_seq_label",
+            subset_id="thai_nner",


Suggested change

subset_id="thai_nner",

subset_id=f"{_DATASETNAME}",

seacrowd/sea_datasets/thai_nner/thai_nner.py

jensan-1 · 2024-01-22T09:45:12Z

seacrowd/sea_datasets/thai_nner/thai_nner.py

+        else:
+            raise ValueError(f"Invalid config: {self.config.name}")
+
+# This template is based on the following template from the datasets package:


Remove the comment out part here and afterwards.

seacrowd/sea_datasets/thai_nner/thai_nner.py

github-actions · 2024-02-06T01:57:11Z

Hi @ & @, may I know if you are still working on this PR?

jensan-1 · 2024-02-06T02:11:14Z

Hi @ & @, may I know if you are still working on this PR?

I've been waiting for the changes for 2 weeks. Please fix or review immediately.

gentaiscool · 2024-02-13T02:06:52Z

@bp-high would you addressing @jen-santoso's review?

bp-high · 2024-02-13T05:22:17Z

Yes @gentaiscool on it. Will push the recommended changes by eod today.

github-actions · 2024-02-28T01:58:02Z

Hi @ & @, may I know if you are still working on this PR?

jensan-1 · 2024-02-28T02:02:03Z

hello @bp-high

Would you mind addressing the changes I requested? I would like to get this PR soon, and also so that we all can proceed with completing this dataloader's PR.

github-actions · 2024-03-11T14:31:47Z

Hi @gentaiscool, @jen-santoso & @bp-high, may I know if you are still working on this PR?

bp-high · 2024-03-11T14:44:33Z

still working on this will address the changes in a few hours

bp-high · 2024-03-26T08:27:08Z

@jen-santoso Have added the necessary changes as suggested in the review and also run the make check_file command.
As for this comment:

The dataset is reported to be split to train, dev, test (numbers 2935, 979, 980 in the paper, however it is only train, test (numbers 3914, 980 in this dataloader. Could you recheck the link, and implement the dataset split as the one in the paper? This implementation fix would also apply to the SplitGenerator

I found that the dataset present in the CONLL format (https://drive.google.com/drive/folders/1FIoSmj-JCRxaDTIbGsiuMhaWERmFUCWq )contains only train and test splits as also mentioned in the paper:

Although I found that there is data present in json format which contains the dev split(https://drive.google.com/drive/folders/1F0fqRdnGMjryp8dAToPc7OEBJlA9Xdi7) when I tried to build it into the CONLL BIOES format tags that was provided by the authors it did not came out identical to the original conll data provided by the authors. I have mailed the authors to ask about this.

jensan-1 · 2024-03-27T07:40:40Z

@bp-high Thank you for the code updates and the investigation of the dataset.
The dataloader runs well and LGTM for now.
For the updates on the splits, I guess for the time being let's leave the dataset split as is and update in another PR (or later) when the authors respond.
WDYT? @SamuelCahyawijaya @holylovenia ?

holylovenia · 2024-04-01T05:15:05Z

@bp-high Thank you for the code updates and the investigation of the dataset. The dataloader runs well and LGTM for now. For the updates on the splits, I guess for the time being let's leave the dataset split as is and update in another PR (or later) when the authors respond. WDYT? @SamuelCahyawijaya @holylovenia ?

Agree. Let's wait for @gentaiscool's review.

holylovenia · 2024-04-08T08:13:13Z

Replacing @gentaiscool with @patrickamadeus for review.

patrickamadeus · 2024-04-20T16:54:42Z

seacrowd/sea_datasets/thai_nner/thai_nner.py

+        def _custom_conll_loader(file_path):
+            # Read file
+            data = open(file_path, "r", encoding="utf-8").readlines()
+
+            # Prepare buffer
+            dataset = []
+            sentence, seq_label = [], []
+            for line in data:
+                if len(line.strip()) > 0:
+                    # Handle lines with a single field
+                    if "\t" in line:
+                        token, label = line[:-1].split("\t")
+                    elif "\t" not in line and " " in line:
+                        line_split = line.split()
+                        token = line_split[0]
+                        label = line_split[1:]
+                    else:
+                        token, label = line.strip(), "O"  # Default label if not provided
+
+                    sentence.append(token)
+                    seq_label.append(label)
+                else:
+                    dataset.append({"sentence": sentence, "label": seq_label})
+                    sentence = []
+                    seq_label = []
+            return dataset


Hi @bp-high! Thanks for the contribution!

I only want to address a minor bug that might happen here since the implementation hasn't passed the test case.

THIS IS YOUR `row['sentence']` ['5', '_', 'มี', '.ค.', '_'] THIS IS YOUR `row['label']` [['B-date', 'S-day', 'O', 'O', 'O', 'O', 'O', 'O'], ['I-date', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['I-date', 'B-month', 'O', 'O', 'O', 'O', 'O', 'O'], ['I-date', 'E-month', 'O', 'O', 'O', 'O', 'O', 'O'], ['I-date', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]

I think there is a slight mistake within the loader function resulting in the captured label becomes list of list instead of list of string, sequentially matched with each token from the sentence. Hence, it prompts TypeError: '<=' not supported between instances of 'int' and 'list' during the check.

Could you please do revision on this? Other than that, LGTM!

Nope this is expected as it is a depth based NER(NNER/ nested NER) with each word having eight depth based ner tags. So the list belongs to each word.

The loader is specially modified to give in the format so that we get all labels(upto depth 8) of each word in a corresponding list marked for it.

patrickamadeus · 2024-04-26T02:55:53Z

seacrowd/sea_datasets/thai_nner/thai_nner.py

+        if self.config.schema == "source":
+            features = datasets.Features({"index": datasets.Value("string"), "tokens": [datasets.Value("string")], "ner_tag": [datasets.Value("string")]})
+
+        elif self.config.schema == "seacrowd_seq_label":
+            features = schemas.seq_label_features(self.label_classes)


Thank you for the clarification @bp-high . Since that's the case, could we revise the feature schema so the implementation satisfies the testcases?

Suggested change

if self.config.schema == "source":

features = datasets.Features({"index": datasets.Value("string"), "tokens": [datasets.Value("string")], "ner_tag": [datasets.Value("string")]})

elif self.config.schema == "seacrowd_seq_label":

features = schemas.seq_label_features(self.label_classes)

features = datasets.Features({"index": datasets.Value("string"), "tokens": [datasets.Value("string")], "ner_tag": [[datasets.Value("string")]]})

# and for SEACrowd seq_label

same goes for the SEACrowd schema

patrickamadeus · 2024-04-26T02:59:09Z

seacrowd/sea_datasets/thai_nner/thai_nner.py

+    label_classes = [
+        "O",
+        "S-org:religious",
+        "E-tv_show",
+        "I-soi",
+        "E-last",
+        "E-fund",
+        "B-country",


While I'm checking, I also found this error ValueError: Invalid string class label S-museum, so far I noticed there are

"S-hotel", "B-postcode", "E-postcode", "B-season", "E-season", ....

and probably more labels that aren't present in label_classes.

Could you please make sure the labels are complete and run python3 -m tests.test_seacrowd seacrowd/sea_datasets/thai_nner/thai_nner.py again?

Thank you!

holylovenia · 2024-05-07T07:58:47Z

A friendly reminder for @bp-high to address @patrickamadeus's suggestions.

holylovenia · 2024-05-13T07:24:58Z

Hi @bp-high, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) by 30 May, so it'd be great if we could wrap up the reviewing and merge this PR before then.

holylovenia · 2024-05-30T04:37:47Z

Hi @bp-high, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

holylovenia · 2024-07-08T06:12:16Z

Hi @bp-high, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. ☺️

Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪

Thanks again!

cc: @patrickamadeus

Add Thai NNER dataset

74bea6b

bp-high requested review from SamuelCahyawijaya, gentaiscool, holylovenia, jamesjaya, jensan-1, ljvmiranda921, ruochen-z, sabilmakbar and yongzx as code owners January 15, 2024 11:09

Add Thai NNER dataset

24bd742

holylovenia removed request for SamuelCahyawijaya, holylovenia, jamesjaya, ljvmiranda921, ruochen-z, sabilmakbar and yongzx January 22, 2024 09:23

holylovenia assigned gentaiscool and jensan-1 Jan 22, 2024

jensan-1 requested changes Jan 22, 2024

View reviewed changes

github-actions bot added the need-fu-pr label Feb 6, 2024

github-actions bot removed the need-fu-pr label Feb 7, 2024

github-actions bot added the need-fu-pr label Feb 28, 2024

github-actions bot removed the need-fu-pr label Mar 12, 2024

Add changes according to review

3662e53

bp-high requested review from MJonibek, danjohnvelasco and tellarin as code owners March 25, 2024 14:09

Add review changes

7dd6f74

jensan-1 approved these changes Mar 27, 2024

View reviewed changes

holylovenia removed request for MJonibek, danjohnvelasco and tellarin April 1, 2024 05:14

holylovenia requested review from patrickamadeus and removed request for gentaiscool April 8, 2024 08:12

holylovenia assigned patrickamadeus and unassigned gentaiscool Apr 8, 2024

patrickamadeus requested changes Apr 20, 2024

View reviewed changes

bp-high requested a review from patrickamadeus April 25, 2024 22:42

patrickamadeus requested changes Apr 26, 2024

View reviewed changes

github-actions bot added the need-fu-pr label Jun 14, 2024

github-actions bot removed the need-fu-pr label Jul 9, 2024

github-actions bot added the need-fu-pr label Jul 23, 2024

Conversation

bp-high commented Jan 15, 2024

Checkbox

Uh oh!

jensan-1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 6, 2024

Uh oh!

jensan-1 commented Feb 6, 2024

Uh oh!

gentaiscool commented Feb 13, 2024

Uh oh!

bp-high commented Feb 13, 2024

Uh oh!

github-actions bot commented Feb 28, 2024

Uh oh!

jensan-1 commented Feb 28, 2024

Uh oh!

github-actions bot commented Mar 11, 2024

Uh oh!

bp-high commented Mar 11, 2024

Uh oh!

bp-high commented Mar 26, 2024

Uh oh!

jensan-1 commented Mar 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

holylovenia commented Apr 1, 2024

Uh oh!

holylovenia commented Apr 8, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holylovenia commented May 7, 2024

Uh oh!

holylovenia commented May 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

holylovenia commented May 30, 2024

Uh oh!

holylovenia commented Jul 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jensan-1 commented Mar 27, 2024 •

edited

Loading

holylovenia commented May 13, 2024 •

edited

Loading