Refactor and modernize StringNormalizer. by yuslepukhin · Pull Request #28320 · microsoft/onnxruntime

yuslepukhin · 2026-05-02T00:30:28Z

This pull request refactors and modernizes the UTF-8 and wide character (wchar_t) string conversion logic in the string normalizer CPU kernel, replacing deprecated and complex code with new, platform-appropriate utilities. The changes improve code maintainability, portability, and performance, especially on non-Windows platforms, by introducing custom UTF-8 conversion routines and simplifying buffer management.

The most important changes are:

UTF-8 and Wide Character Conversion Utilities:

Added new UTF-8 <-> wchar_t conversion functions (WideToUtf8RequiredSize, WideToUtf8, Utf8ToWide, and Utf8ToWideString) for non-Windows platforms in utf8_util.h, avoiding deprecated std::codecvt and providing robust Unicode handling.
Updated Utf8ConverterGeneric in string_normalizer.cc to use these new utilities, greatly simplifying the code and removing legacy/deprecated conversion logic.

Code Simplification and Performance:

Simplified buffer size estimation for conversions: now directly uses the UTF-8 string size as an upper bound for the wide buffer, removing the need for a full decode pass just to compute buffer sizes.
Improved comments and logic for case-insensitive filtering, clarifying why lowercasing is used and how conversions are managed for efficiency. [1] [2]

Cleanup and Modernization:

Removed all usage of deprecated std::codecvt and related workaround code, as well as unnecessary includes and platform-specific handling, resulting in cleaner and more maintainable code. [1] [2] [3]

These changes collectively modernize the string normalization kernel, improve portability, and make the codebase easier to maintain.

Copilot

Pull request overview

This PR modernizes the CPU StringNormalizer kernel’s UTF-8 / wchar_t conversion path by replacing deprecated/complex conversion logic with shared UTF-8 utilities, simplifying buffer sizing/management, and adding targeted tests to improve coverage.

Changes:

Add non-Windows UTF-8 ↔ wchar_t conversion helpers to core/common/utf8_util.h.
Refactor string_normalizer.cc to use the shared utilities and simplify wide-buffer sizing/allocation logic.
Expand unit tests for both UTF-8 utilities and StringNormalizer edge cases (multilingual, filtering, shapes, invalid inputs).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
onnxruntime/core/common/utf8_util.h	Adds new UTF-8↔wide conversion helpers and tweaks `utf8_bytes` bit tests.
onnxruntime/core/providers/cpu/text/string_normalizer.cc	Switches generic conversion to `utf8_util` helpers and gates wide-buffer sizing/allocation.
onnxruntime/core/providers/cpu/text/string_normalizer.h	Updates rationale comments for case-insensitive comparison strategy.
onnxruntime/test/common/utf8_util_test.cc	Adds broad unit test coverage for UTF-8 helpers and conversions.
onnxruntime/test/providers/cpu/text/string_normalizer_test.cc	Adds many coverage-focused tests for filtering, casing, shapes, and locale/error handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tianleiwu · 2026-05-05T00:56:53Z

+              "Non-Windows UTF-8/wchar_t conversion helpers require wchar_t to be at least 32 bits.");
+
+/// Compute the number of UTF-8 bytes required to encode a wide string.
+inline size_t WideToUtf8RequiredSize(const std::wstring& wstr) {


Why not use standard windows APIs: WideCharToMultiByte and MultiByteToWideChar etc?

tianleiwu

I found one remaining issue in the new UTF-8 conversion helper. The earlier shape-test concern is already covered in an existing thread, so I’m not duplicating it here.

tianleiwu · 2026-05-05T01:02:33Z

+      *dest++ = static_cast<char>(cp);
+    } else if (cp <= 0x7FF) {
+      if (dest + 1 >= dest_end) {
+        return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Destination buffer too small for UTF-8 conversion");


The new undersized-buffer checks still form out-of-range pointers before returning the error in some cases. For example, if str is empty or has only one byte remaining and cp needs 3 or 4 bytes, expressions such as dest + 2 or dest + 3 go past the one-past-the-end pointer, which is undefined behavior in C++. Please compute the remaining capacity first, e.g. const auto remaining = static_cast<size_t>(dest_end - dest); if (remaining < required_bytes) ..., then write the bytes. It would also be worth adding tests for undersized 3-byte and 4-byte encodings, including an empty destination buffer.

yuslepukhin added 3 commits April 30, 2026 17:08

Address string normalizer inefficiencies

dd1d6cd

Remove caching as it bring allocations back

b5fda00

Fix tests

275d895

yuslepukhin requested a review from Copilot May 2, 2026 00:30

Copilot started reviewing on behalf of yuslepukhin May 2, 2026 00:31 View session

Copilot AI reviewed May 2, 2026

View reviewed changes

Address review comments

213fb4c

yuslepukhin requested a review from Copilot May 2, 2026 00:57

Copilot started reviewing on behalf of yuslepukhin May 2, 2026 00:58 View session

Copilot AI reviewed May 2, 2026

View reviewed changes

Address review comments and CI failures

8e2da6e

yuslepukhin requested a review from Copilot May 4, 2026 18:25

Address stylistic thing

3c1b26e

Copilot AI reviewed May 4, 2026

View reviewed changes

Copilot started reviewing on behalf of yuslepukhin May 4, 2026 19:06 View session

Address issues

aaa451e

yuslepukhin requested review from Copilot and tianleiwu May 4, 2026 19:44

Copilot started reviewing on behalf of yuslepukhin May 4, 2026 21:41 View session

Copilot AI reviewed May 4, 2026

View reviewed changes

Comment thread onnxruntime/core/providers/cpu/text/string_normalizer.cc

Comment thread onnxruntime/core/providers/cpu/text/string_normalizer.cc

Comment thread onnxruntime/core/providers/cpu/text/string_normalizer.cc

tianleiwu reviewed May 5, 2026

View reviewed changes

Conversation

yuslepukhin commented May 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianleiwu May 5, 2026

Choose a reason for hiding this comment

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Uh oh!

tianleiwu May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants