-
Notifications
You must be signed in to change notification settings - Fork 143
Korean TN fixes: cardinal, decimal, fraction, date #374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: ko_tn_staging_v1
Are you sure you want to change the base?
Conversation
Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com>
for more information, see https://pre-commit.ci
…zation Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com>
for more information, see https://pre-commit.ci
|
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
|
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
|
This PR was closed because it has been inactive for 7 days since being marked as stale. |
| # Grouping separators to remove inside numbers (e.g., "1,234", "1’234", NBSP) | ||
| SEP = pynini.union(",", "’", "'", "\u00a0", "\u2009", "\u202f") | ||
| # Optional small whitespace inside parentheses or after signs | ||
| WS = pynini.closure(pynini.accep(" "), 0, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure you don't want to use NEMO_SPACE
| ten_thousands = NEMO_DIGIT**5 | ||
| graph_ten_thousand_component = (pynini.cross('1', '만') | (graph_digit + pynutil.insert('만'))) + pynini.union( | ||
| graph_ten_thousand_component = ( | ||
| pynini.cross('1', '만') | (graph_digit_no_zero_one + pynutil.insert('만')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add another level of parentheses, concat and union aren't reliable order of operations.
| optional_sign = pynini.closure(pynutil.insert('negative: "true" ') + pynini.cross("-", ""), 0, 1) | ||
| final_graph = optional_sign + pynutil.insert('integer: "') + graph_num + pynutil.insert('"') | ||
| # Delete group separators when they appear between digits (e.g., "1,234" -> "1234") | ||
| delete_sep_between_digits = pynini.cdrewrite( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking: is there any occurence of European numbering in Korean text?
| sep_del = pynutil.delete(pynini.closure(sep_token, 1)) # allow mix of - or space | ||
|
|
||
| cc16_grouped = four + sep_del + four + sep_del + four + sep_del + four | ||
| sep_to_space = pynutil.delete(pynini.closure(sep_token, 0, 1)) + insert_space |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just do pynutil.delete(""").ques. Same thing
|
|
||
| add_sep = pynutil.insert(", ") # standard block separator ", " | ||
| # Separator between digit blocks (e.g., "-" or ".") | ||
| add_sep = pynutil.delete("-") | pynutil.delete(".") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename. Add_sep deleting is confusing
| # Separator between digit blocks (e.g., "-" or ".") | ||
| add_sep = pynutil.delete("-") | pynutil.delete(".") | ||
| # Optional space inserted between blocks | ||
| sep_space = insert_space |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't a constructive operation
|
|
||
| token = pynutil.insert("tokens { ") + classify + pynutil.insert(" }") | ||
| tagger = pynini.closure(token, 1) | ||
| space = pynini.closure(NEMO_WHITE_SPACE, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
? Just use NEMO_WHITE_SPACE
What does this PR do ?
Add fixes and improvements for Korean TN: cardinal, decimal, ordinal, fraction, date, and post-processing.
Before your PR is "Ready for review"
Pre checks:
git commit -sto sign.pytestor (if your machine does not have GPU)pytest --cpufrom the root folder (given you marked your test cases accordingly@pytest.mark.run_only_on('CPU')).bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...pytestand Sparrowhawk here.__init__.pyfor every folder and subfolder, includingdatafolder which has .TSV files?Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.to all newly added Python files?Copyright 2015 and onwards Google, Inc.. See an example here.try import: ... except: ...) if not already done.PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.