Skip to content

Refactor and modernize StringNormalizer.#28320

Open
yuslepukhin wants to merge 7 commits intomainfrom
yuslepukhin/stringnorm_conv
Open

Refactor and modernize StringNormalizer.#28320
yuslepukhin wants to merge 7 commits intomainfrom
yuslepukhin/stringnorm_conv

Conversation

@yuslepukhin
Copy link
Copy Markdown
Member

This pull request refactors and modernizes the UTF-8 and wide character (wchar_t) string conversion logic in the string normalizer CPU kernel, replacing deprecated and complex code with new, platform-appropriate utilities. The changes improve code maintainability, portability, and performance, especially on non-Windows platforms, by introducing custom UTF-8 conversion routines and simplifying buffer management.

The most important changes are:

UTF-8 and Wide Character Conversion Utilities:

  • Added new UTF-8 <-> wchar_t conversion functions (WideToUtf8RequiredSize, WideToUtf8, Utf8ToWide, and Utf8ToWideString) for non-Windows platforms in utf8_util.h, avoiding deprecated std::codecvt and providing robust Unicode handling.
  • Updated Utf8ConverterGeneric in string_normalizer.cc to use these new utilities, greatly simplifying the code and removing legacy/deprecated conversion logic.

Code Simplification and Performance:

  • Simplified buffer size estimation for conversions: now directly uses the UTF-8 string size as an upper bound for the wide buffer, removing the need for a full decode pass just to compute buffer sizes.
  • Improved comments and logic for case-insensitive filtering, clarifying why lowercasing is used and how conversions are managed for efficiency. [1] [2]

Cleanup and Modernization:

  • Removed all usage of deprecated std::codecvt and related workaround code, as well as unnecessary includes and platform-specific handling, resulting in cleaner and more maintainable code. [1] [2] [3]

These changes collectively modernize the string normalization kernel, improve portability, and make the codebase easier to maintain.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR modernizes the CPU StringNormalizer kernel’s UTF-8 / wchar_t conversion path by replacing deprecated/complex conversion logic with shared UTF-8 utilities, simplifying buffer sizing/management, and adding targeted tests to improve coverage.

Changes:

  • Add non-Windows UTF-8 ↔ wchar_t conversion helpers to core/common/utf8_util.h.
  • Refactor string_normalizer.cc to use the shared utilities and simplify wide-buffer sizing/allocation logic.
  • Expand unit tests for both UTF-8 utilities and StringNormalizer edge cases (multilingual, filtering, shapes, invalid inputs).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
onnxruntime/core/common/utf8_util.h Adds new UTF-8↔wide conversion helpers and tweaks utf8_bytes bit tests.
onnxruntime/core/providers/cpu/text/string_normalizer.cc Switches generic conversion to utf8_util helpers and gates wide-buffer sizing/allocation.
onnxruntime/core/providers/cpu/text/string_normalizer.h Updates rationale comments for case-insensitive comparison strategy.
onnxruntime/test/common/utf8_util_test.cc Adds broad unit test coverage for UTF-8 helpers and conversions.
onnxruntime/test/providers/cpu/text/string_normalizer_test.cc Adds many coverage-focused tests for filtering, casing, shapes, and locale/error handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/common/utf8_util.h
Comment thread onnxruntime/test/providers/cpu/text/string_normalizer_test.cc
Comment thread onnxruntime/test/providers/cpu/text/string_normalizer_test.cc
Comment thread onnxruntime/core/providers/cpu/text/string_normalizer.h Outdated
Comment thread onnxruntime/core/providers/cpu/text/string_normalizer.cc
Comment thread onnxruntime/core/common/utf8_util.h
Comment thread onnxruntime/test/common/utf8_util_test.cc
Comment thread onnxruntime/core/common/utf8_util.h
Comment thread onnxruntime/core/common/utf8_util.h
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/common/utf8_util.h
Comment thread onnxruntime/core/common/utf8_util.h
Comment thread onnxruntime/core/providers/cpu/text/string_normalizer.cc Outdated
Comment thread onnxruntime/test/providers/cpu/text/string_normalizer_test.cc
Comment thread onnxruntime/core/common/utf8_util.h
@yuslepukhin yuslepukhin requested a review from Copilot May 4, 2026 18:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/test/common/utf8_util_test.cc
Comment thread onnxruntime/test/common/utf8_util_test.cc
Comment thread onnxruntime/core/common/utf8_util.h
Comment thread onnxruntime/core/common/utf8_util.h Outdated
Comment thread onnxruntime/core/providers/cpu/text/string_normalizer.cc Outdated
Comment thread onnxruntime/test/providers/cpu/text/string_normalizer_test.cc Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/providers/cpu/text/string_normalizer.cc
Comment thread onnxruntime/core/providers/cpu/text/string_normalizer.cc
Comment thread onnxruntime/core/providers/cpu/text/string_normalizer.cc
"Non-Windows UTF-8/wchar_t conversion helpers require wchar_t to be at least 32 bits.");

/// Compute the number of UTF-8 bytes required to encode a wide string.
inline size_t WideToUtf8RequiredSize(const std::wstring& wstr) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use standard windows APIs: WideCharToMultiByte and MultiByteToWideChar etc?

Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one remaining issue in the new UTF-8 conversion helper. The earlier shape-test concern is already covered in an existing thread, so I’m not duplicating it here.

*dest++ = static_cast<char>(cp);
} else if (cp <= 0x7FF) {
if (dest + 1 >= dest_end) {
return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Destination buffer too small for UTF-8 conversion");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new undersized-buffer checks still form out-of-range pointers before returning the error in some cases. For example, if str is empty or has only one byte remaining and cp needs 3 or 4 bytes, expressions such as dest + 2 or dest + 3 go past the one-past-the-end pointer, which is undefined behavior in C++. Please compute the remaining capacity first, e.g. const auto remaining = static_cast<size_t>(dest_end - dest); if (remaining < required_bytes) ..., then write the bytes. It would also be worth adding tests for undersized 3-byte and 4-byte encodings, including an empty destination buffer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants