Skip to content

Add custom callable and newline sentence tokenizers#11

Merged
429er merged 8 commits into
mainfrom
feature/custom-sentence-tokenizer
Feb 1, 2026
Merged

Add custom callable and newline sentence tokenizers#11
429er merged 8 commits into
mainfrom
feature/custom-sentence-tokenizer

Conversation

@429er
Copy link
Copy Markdown
Owner

@429er 429er commented Feb 1, 2026

Summary

This PR adds support for custom callable sentence tokenizers and a built-in newline tokenizer, along with improvements to the pysbd implementation.

Changes

New Features

  • Custom callable sentence tokenizers - Users can now pass a custom function to sent_tokenizer parameter that returns sentence offsets as torch.Tensor. This enables domain-specific segmentation (e.g., speaker turns in transcripts, code blocks, etc.)

  • Built-in newline tokenizer - New sent_tokenizer="newline" option for line-oriented text like code, transcripts, or structured data. Splits on \n and skips empty lines.

Improvements

  • pysbd native char_span - Refactored get_sentence_offsets_pysbd() to use pysbd's native char_span=True feature instead of manually searching for sentences. More reliable, cleaner code, and more accurate offset detection.

  • Documentation - Added explicit documentation that sentence offsets are half-open intervals [start, end) that work with Python slicing (text[start:end]).

Tests

  • Added comprehensive tests for newline tokenizer (basic usage and empty line handling)
  • Added tests for custom callable tokenizers (basic usage and error handling)
  • All 219 tests pass

Examples

Custom callable tokenizer:

def speaker_turns(text: str) -> torch.Tensor:
    '''Split transcript by speaker turns.'''
    import re
    pattern = r'(?:^|\n)(?:Speaker [A-Z]:|[\w+]:)'
    matches = list(re.finditer(pattern, text))
    if not matches:
        return torch.tensor([[0, len(text)]]).reshape(-1, 2)
    offsets = []
    for i in range(len(matches)):
        start = matches[i].start()
        end = matches[i+1].start() if i+1 < len(matches) else len(text)
        offsets.append([start, end])
    return torch.tensor(offsets).reshape(-1, 2)

df, embeds = encoder.encode(transcripts, sent_tokenizer=speaker_turns)

Newline tokenizer:

# For code or line-oriented text
df, embeds = encoder.encode(code_snippets, sent_tokenizer="newline")

Related

  • Closes issue about custom sentence tokenizers (if one exists)
  • Enables use cases like speaker-based chunking, code chunking, and custom domain segmentation

429er added 8 commits February 1, 2026 17:35
Enable users to provide custom sentence tokenization logic via callables,
complementing the existing built-in tokenizers (blingfire, nltk, pysbd, syntok).

Changes:
- Update get_sentence_offsets() to accept callable in addition to string method names
- Add validation for callable return type (torch.Tensor of shape (n_sentences, 2))
- Update LateEncoder.encode() to accept callable sent_tokenizer parameter
- Add comprehensive documentation with custom tokenizer example
- Add type hints: Callable[[str], torch.Tensor]

Use case: Enables domain-specific sentence segmentation for specialized
text types (legal documents, code, structured data, transcripts, etc.)
without modifying the core library.
Add 'newline' as a built-in sentence tokenizer option for line-oriented
text (code, transcripts, structured data).

Changes:
- Add get_sentence_offsets_newline() function to chunk.py
- Register 'newline' in methods dict
- Update documentation to include 'newline' option
- Update custom tokenizer example to use speaker-turn segmentation
  (more realistic use case than newline splitting)

This makes newline splitting a first-class option alongside blingfire,
nltk, pysbd, and syntok.
Clarify that sentence offsets use half-open intervals [start, end) which
work directly with Python slicing (text[start:end] extracts the sentence).

Changes:
- Add explanation to get_sentence_offsets() docstring
- Add concrete example showing offset usage with slicing
- Update custom tokenizer examples to document the requirement
- Clarify in encode() and tokenize_with_sentence_boundaries() docstrings

This helps users write correct custom sentence tokenizers by making the
expected offset format explicit.
Provides a simple wrapper around encode() that returns chunk metadata
without embeddings. Useful for debugging and testing chunking strategies.
Use triple-quoted string to make it obvious that the example shows one
document with three sentences. Eliminates confusion from implicit string
concatenation (missing commas between strings).

- Single variable 'doc' emphasizes one document
- Triple-quotes make multi-line string obvious
- Each sentence on its own line for clarity
@429er 429er merged commit acf6f3e into main Feb 1, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant