Skip to content

Add support for actively maintained sentence tokenizers (spaCy, Stanza, wtpsplit) #15

@429er

Description

@429er

Motivation

As documented in #13, the current default sentence tokenizer (BlingFire) is unmaintained, and several of our optional tokenizers (syntok, pysbd) are also abandoned. While we're switching the default to NLTK Punkt, it would be valuable to support additional actively maintained alternatives to give users more options and reduce dependency on any single unmaintained package.

Proposed Solution

Add optional support for three actively maintained sentence tokenizers:

  1. spaCy - Industrial-strength NLP library with robust sentence boundary detection

    • Status: Actively maintained
    • Performance: Fast (Cython-optimized)
    • Dependencies: Requires spaCy + language model download
  2. Stanza - Stanford NLP's neural pipeline

    • Status: Actively maintained (v1.11.0, October 2025)
    • Performance: Accurate but heavier than rule-based alternatives
    • Dependencies: Requires PyTorch (already a dependency) + model downloads
  3. wtpsplit - ML-based SOTA sentence tokenizer

    • Status: Actively maintained
    • Performance: Best accuracy but slowest (12-450x slower than BlingFire depending on backend)
    • Dependencies: Requires PyTorch/ONNX Runtime

Implementation

Following the existing pattern in afterthoughts/chunk.py:

def get_sentence_offsets_spacy(text: str) -> torch.Tensor:
    """Extract sentence offsets using spaCy."""
    nlp = require_spacy()  # Lazy import helper
    doc = nlp(text)
    offsets = torch.tensor([(sent.start_char, sent.end_char) for sent in doc.sents])
    return offsets

def get_sentence_offsets_stanza(text: str) -> torch.Tensor:
    """Extract sentence offsets using Stanza."""
    nlp = require_stanza()  # Lazy import helper
    doc = nlp(text)
    offsets = torch.tensor([(sent.start_char, sent.end_char) for sent in doc.sentences])
    return offsets

def get_sentence_offsets_wtpsplit(text: str) -> torch.Tensor:
    """Extract sentence offsets using wtpsplit."""
    model = require_wtpsplit()  # Lazy import helper
    # Implementation details TBD
    ...

Add to pyproject.toml:

[project.optional-dependencies]
spacy = ["spacy>=3.0"]
stanza = ["stanza>=1.11"]
wtpsplit-ort-gpu = ["wtpsplit[ort-gpu]"]
wtpsplit-ort-cpu = ["wtpsplit[ort-cpu]"]
wtpsplit = ["wtpsplit"]

Update sent_tokenizer parameter to accept: "blingfire", "nltk", "pysbd", "syntok", "spacy", "stanza", "wtpsplit"

Tradeoffs

  • spaCy: Requires model downloads but provides robust sentence detection with broader NLP capabilities
  • Stanza: Heavy footprint but state-of-the-art accuracy for some languages
  • wtpsplit: Best accuracy but significant performance penalty (acceptable for some use cases)

These are all optional dependencies, so users only install what they need. Having actively maintained alternatives reduces long-term maintenance risk and gives users flexibility to choose based on their specific accuracy/speed requirements.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions