Add support for Chinese sentence boundary character (。)#48
Open
jjroelofs wants to merge 1 commit intoTessmore:masterfrom
Open
Add support for Chinese sentence boundary character (。)#48jjroelofs wants to merge 1 commit intoTessmore:masterfrom
jjroelofs wants to merge 1 commit intoTessmore:masterfrom
Conversation
- Update Match.js and sbd.js to recognize Chinese full stop (。) as a sentence boundary - Modify isConcatenated function to handle Chinese characters - Add test cases for Chinese sentences - Update README to mention Chinese sentence boundary support This change improves the library's ability to correctly tokenize Chinese text and mixed Chinese-English text into sentences.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I understand between this PR and the Hindi support PR there is some conflict, let me know if you have a better idea about making the plugin support multiple scripts in a scalable way
This change improves the library's ability to correctly tokenize Chinese text and mixed Chinese-English text into sentences.