Add custom callable and newline sentence tokenizers by 429er · Pull Request #11 · 429er/afterthoughts

429er · 2026-02-01T22:52:43Z

Summary

This PR adds support for custom callable sentence tokenizers and a built-in newline tokenizer, along with improvements to the pysbd implementation.

Changes

New Features

Custom callable sentence tokenizers - Users can now pass a custom function to sent_tokenizer parameter that returns sentence offsets as torch.Tensor. This enables domain-specific segmentation (e.g., speaker turns in transcripts, code blocks, etc.)
Built-in newline tokenizer - New sent_tokenizer="newline" option for line-oriented text like code, transcripts, or structured data. Splits on \n and skips empty lines.

Improvements

pysbd native char_span - Refactored get_sentence_offsets_pysbd() to use pysbd's native char_span=True feature instead of manually searching for sentences. More reliable, cleaner code, and more accurate offset detection.
Documentation - Added explicit documentation that sentence offsets are half-open intervals [start, end) that work with Python slicing (text[start:end]).

Tests

Added comprehensive tests for newline tokenizer (basic usage and empty line handling)
Added tests for custom callable tokenizers (basic usage and error handling)
All 219 tests pass

Examples

Custom callable tokenizer:

def speaker_turns(text: str) -> torch.Tensor:
    '''Split transcript by speaker turns.'''
    import re
    pattern = r'(?:^|\n)(?:Speaker [A-Z]:|[\w+]:)'
    matches = list(re.finditer(pattern, text))
    if not matches:
        return torch.tensor([[0, len(text)]]).reshape(-1, 2)
    offsets = []
    for i in range(len(matches)):
        start = matches[i].start()
        end = matches[i+1].start() if i+1 < len(matches) else len(text)
        offsets.append([start, end])
    return torch.tensor(offsets).reshape(-1, 2)

df, embeds = encoder.encode(transcripts, sent_tokenizer=speaker_turns)

Newline tokenizer:

# For code or line-oriented text
df, embeds = encoder.encode(code_snippets, sent_tokenizer="newline")

Enable users to provide custom sentence tokenization logic via callables, complementing the existing built-in tokenizers (blingfire, nltk, pysbd, syntok). Changes: - Update get_sentence_offsets() to accept callable in addition to string method names - Add validation for callable return type (torch.Tensor of shape (n_sentences, 2)) - Update LateEncoder.encode() to accept callable sent_tokenizer parameter - Add comprehensive documentation with custom tokenizer example - Add type hints: Callable[[str], torch.Tensor] Use case: Enables domain-specific sentence segmentation for specialized text types (legal documents, code, structured data, transcripts, etc.) without modifying the core library.

Add 'newline' as a built-in sentence tokenizer option for line-oriented text (code, transcripts, structured data). Changes: - Add get_sentence_offsets_newline() function to chunk.py - Register 'newline' in methods dict - Update documentation to include 'newline' option - Update custom tokenizer example to use speaker-turn segmentation (more realistic use case than newline splitting) This makes newline splitting a first-class option alongside blingfire, nltk, pysbd, and syntok.

Clarify that sentence offsets use half-open intervals [start, end) which work directly with Python slicing (text[start:end] extracts the sentence). Changes: - Add explanation to get_sentence_offsets() docstring - Add concrete example showing offset usage with slicing - Update custom tokenizer examples to document the requirement - Clarify in encode() and tokenize_with_sentence_boundaries() docstrings This helps users write correct custom sentence tokenizers by making the expected offset format explicit.

Provides a simple wrapper around encode() that returns chunk metadata without embeddings. Useful for debugging and testing chunking strategies.

This reverts commit 16be7d0.

Use triple-quoted string to make it obvious that the example shows one document with three sentences. Eliminates confusion from implicit string concatenation (missing commas between strings). - Single variable 'doc' emphasizes one document - Triple-quotes make multi-line string obvious - Each sentence on its own line for clarity

429er added 8 commits February 1, 2026 17:35

Use pysbd's native char_span feature for sentence offsets

e7145c6

Add get_chunks method to preview chunking without encoding

4f8a58c

Provides a simple wrapper around encode() that returns chunk metadata without embeddings. Useful for debugging and testing chunking strategies.

Revert "Add get_chunks method to preview chunking without encoding"

d15b17e

This reverts commit 16be7d0.

Add tests for newline and custom callable sentence tokenizers

33b8560

429er merged commit acf6f3e into main Feb 1, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add custom callable and newline sentence tokenizers#11

Add custom callable and newline sentence tokenizers#11
429er merged 8 commits into
mainfrom
feature/custom-sentence-tokenizer

429er commented Feb 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

429er commented Feb 1, 2026

Summary

Changes

New Features

Improvements

Tests

Examples

Related

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant