Skip to content

Conversation

@bbae0312
Copy link

@bbae0312 bbae0312 commented Dec 16, 2025

What does this PR do ?

Add fixes and improvements for Korean TN: cardinal, decimal, ordinal, fraction, date, and post-processing.

Before your PR is "Ready for review"

Pre checks:

  • Have you signed your commits? Use git commit -s to sign.
  • Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • Remove import guards (try import: ... except: ...) if not already done.
  • If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • New Feature
  • Bugfix
  • Documentation
  • Test

If you haven't finished some of the above items you can still open "Draft" PR.

bbae0312 and others added 4 commits December 15, 2025 16:00
Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com>
…zation

Signed-off-by: Jinwoo Bae <34386414+bbae0312@users.noreply.github.com>
@github-actions
Copy link

github-actions bot commented Jan 1, 2026

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jan 1, 2026
@mgrafu mgrafu removed the Stale label Jan 5, 2026
@github-actions
Copy link

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jan 20, 2026
@github-actions
Copy link

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this Jan 27, 2026
@tbartley94 tbartley94 reopened this Jan 27, 2026
# Grouping separators to remove inside numbers (e.g., "1,234", "1’234", NBSP)
SEP = pynini.union(",", "’", "'", "\u00a0", "\u2009", "\u202f")
# Optional small whitespace inside parentheses or after signs
WS = pynini.closure(pynini.accep(" "), 0, 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure you don't want to use NEMO_SPACE

ten_thousands = NEMO_DIGIT**5
graph_ten_thousand_component = (pynini.cross('1', '만') | (graph_digit + pynutil.insert('만'))) + pynini.union(
graph_ten_thousand_component = (
pynini.cross('1', '만') | (graph_digit_no_zero_one + pynutil.insert('만'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add another level of parentheses, concat and union aren't reliable order of operations.

optional_sign = pynini.closure(pynutil.insert('negative: "true" ') + pynini.cross("-", ""), 0, 1)
final_graph = optional_sign + pynutil.insert('integer: "') + graph_num + pynutil.insert('"')
# Delete group separators when they appear between digits (e.g., "1,234" -> "1234")
delete_sep_between_digits = pynini.cdrewrite(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking: is there any occurence of European numbering in Korean text?

sep_del = pynutil.delete(pynini.closure(sep_token, 1)) # allow mix of - or space

cc16_grouped = four + sep_del + four + sep_del + four + sep_del + four
sep_to_space = pynutil.delete(pynini.closure(sep_token, 0, 1)) + insert_space
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just do pynutil.delete(""").ques. Same thing


add_sep = pynutil.insert(", ") # standard block separator ", "
# Separator between digit blocks (e.g., "-" or ".")
add_sep = pynutil.delete("-") | pynutil.delete(".")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename. Add_sep deleting is confusing

# Separator between digit blocks (e.g., "-" or ".")
add_sep = pynutil.delete("-") | pynutil.delete(".")
# Optional space inserted between blocks
sep_space = insert_space
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't a constructive operation


token = pynutil.insert("tokens { ") + classify + pynutil.insert(" }")
tagger = pynini.closure(token, 1)
space = pynini.closure(NEMO_WHITE_SPACE, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? Just use NEMO_WHITE_SPACE

@github-actions github-actions bot removed the Stale label Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants