Refactor and modernize StringNormalizer.#28320
Conversation
There was a problem hiding this comment.
Pull request overview
This PR modernizes the CPU StringNormalizer kernel’s UTF-8 / wchar_t conversion path by replacing deprecated/complex conversion logic with shared UTF-8 utilities, simplifying buffer sizing/management, and adding targeted tests to improve coverage.
Changes:
- Add non-Windows UTF-8 ↔
wchar_tconversion helpers tocore/common/utf8_util.h. - Refactor
string_normalizer.ccto use the shared utilities and simplify wide-buffer sizing/allocation logic. - Expand unit tests for both UTF-8 utilities and
StringNormalizeredge cases (multilingual, filtering, shapes, invalid inputs).
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/core/common/utf8_util.h | Adds new UTF-8↔wide conversion helpers and tweaks utf8_bytes bit tests. |
| onnxruntime/core/providers/cpu/text/string_normalizer.cc | Switches generic conversion to utf8_util helpers and gates wide-buffer sizing/allocation. |
| onnxruntime/core/providers/cpu/text/string_normalizer.h | Updates rationale comments for case-insensitive comparison strategy. |
| onnxruntime/test/common/utf8_util_test.cc | Adds broad unit test coverage for UTF-8 helpers and conversions. |
| onnxruntime/test/providers/cpu/text/string_normalizer_test.cc | Adds many coverage-focused tests for filtering, casing, shapes, and locale/error handling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "Non-Windows UTF-8/wchar_t conversion helpers require wchar_t to be at least 32 bits."); | ||
|
|
||
| /// Compute the number of UTF-8 bytes required to encode a wide string. | ||
| inline size_t WideToUtf8RequiredSize(const std::wstring& wstr) { |
There was a problem hiding this comment.
Why not use standard windows APIs: WideCharToMultiByte and MultiByteToWideChar etc?
tianleiwu
left a comment
There was a problem hiding this comment.
I found one remaining issue in the new UTF-8 conversion helper. The earlier shape-test concern is already covered in an existing thread, so I’m not duplicating it here.
| *dest++ = static_cast<char>(cp); | ||
| } else if (cp <= 0x7FF) { | ||
| if (dest + 1 >= dest_end) { | ||
| return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Destination buffer too small for UTF-8 conversion"); |
There was a problem hiding this comment.
The new undersized-buffer checks still form out-of-range pointers before returning the error in some cases. For example, if str is empty or has only one byte remaining and cp needs 3 or 4 bytes, expressions such as dest + 2 or dest + 3 go past the one-past-the-end pointer, which is undefined behavior in C++. Please compute the remaining capacity first, e.g. const auto remaining = static_cast<size_t>(dest_end - dest); if (remaining < required_bytes) ..., then write the bytes. It would also be worth adding tests for undersized 3-byte and 4-byte encodings, including an empty destination buffer.
This pull request refactors and modernizes the UTF-8 and wide character (wchar_t) string conversion logic in the string normalizer CPU kernel, replacing deprecated and complex code with new, platform-appropriate utilities. The changes improve code maintainability, portability, and performance, especially on non-Windows platforms, by introducing custom UTF-8 conversion routines and simplifying buffer management.
The most important changes are:
UTF-8 and Wide Character Conversion Utilities:
WideToUtf8RequiredSize,WideToUtf8,Utf8ToWide, andUtf8ToWideString) for non-Windows platforms inutf8_util.h, avoiding deprecatedstd::codecvtand providing robust Unicode handling.Utf8ConverterGenericinstring_normalizer.ccto use these new utilities, greatly simplifying the code and removing legacy/deprecated conversion logic.Code Simplification and Performance:
Cleanup and Modernization:
std::codecvtand related workaround code, as well as unnecessary includes and platform-specific handling, resulting in cleaner and more maintainable code. [1] [2] [3]These changes collectively modernize the string normalization kernel, improve portability, and make the codebase easier to maintain.