Add custom callable and newline sentence tokenizers#11
Merged
Conversation
Enable users to provide custom sentence tokenization logic via callables, complementing the existing built-in tokenizers (blingfire, nltk, pysbd, syntok). Changes: - Update get_sentence_offsets() to accept callable in addition to string method names - Add validation for callable return type (torch.Tensor of shape (n_sentences, 2)) - Update LateEncoder.encode() to accept callable sent_tokenizer parameter - Add comprehensive documentation with custom tokenizer example - Add type hints: Callable[[str], torch.Tensor] Use case: Enables domain-specific sentence segmentation for specialized text types (legal documents, code, structured data, transcripts, etc.) without modifying the core library.
Add 'newline' as a built-in sentence tokenizer option for line-oriented text (code, transcripts, structured data). Changes: - Add get_sentence_offsets_newline() function to chunk.py - Register 'newline' in methods dict - Update documentation to include 'newline' option - Update custom tokenizer example to use speaker-turn segmentation (more realistic use case than newline splitting) This makes newline splitting a first-class option alongside blingfire, nltk, pysbd, and syntok.
Clarify that sentence offsets use half-open intervals [start, end) which work directly with Python slicing (text[start:end] extracts the sentence). Changes: - Add explanation to get_sentence_offsets() docstring - Add concrete example showing offset usage with slicing - Update custom tokenizer examples to document the requirement - Clarify in encode() and tokenize_with_sentence_boundaries() docstrings This helps users write correct custom sentence tokenizers by making the expected offset format explicit.
Provides a simple wrapper around encode() that returns chunk metadata without embeddings. Useful for debugging and testing chunking strategies.
This reverts commit 16be7d0.
Use triple-quoted string to make it obvious that the example shows one document with three sentences. Eliminates confusion from implicit string concatenation (missing commas between strings). - Single variable 'doc' emphasizes one document - Triple-quotes make multi-line string obvious - Each sentence on its own line for clarity
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds support for custom callable sentence tokenizers and a built-in newline tokenizer, along with improvements to the pysbd implementation.
Changes
New Features
Custom callable sentence tokenizers - Users can now pass a custom function to
sent_tokenizerparameter that returns sentence offsets astorch.Tensor. This enables domain-specific segmentation (e.g., speaker turns in transcripts, code blocks, etc.)Built-in newline tokenizer - New
sent_tokenizer="newline"option for line-oriented text like code, transcripts, or structured data. Splits on\nand skips empty lines.Improvements
pysbd native char_span - Refactored
get_sentence_offsets_pysbd()to use pysbd's nativechar_span=Truefeature instead of manually searching for sentences. More reliable, cleaner code, and more accurate offset detection.Documentation - Added explicit documentation that sentence offsets are half-open intervals
[start, end)that work with Python slicing (text[start:end]).Tests
Examples
Custom callable tokenizer:
Newline tokenizer:
Related