Motivation
As documented in #13, the current default sentence tokenizer (BlingFire) is unmaintained, and several of our optional tokenizers (syntok, pysbd) are also abandoned. While we're switching the default to NLTK Punkt, it would be valuable to support additional actively maintained alternatives to give users more options and reduce dependency on any single unmaintained package.
Proposed Solution
Add optional support for three actively maintained sentence tokenizers:
-
spaCy - Industrial-strength NLP library with robust sentence boundary detection
- Status: Actively maintained
- Performance: Fast (Cython-optimized)
- Dependencies: Requires spaCy + language model download
-
Stanza - Stanford NLP's neural pipeline
- Status: Actively maintained (v1.11.0, October 2025)
- Performance: Accurate but heavier than rule-based alternatives
- Dependencies: Requires PyTorch (already a dependency) + model downloads
-
wtpsplit - ML-based SOTA sentence tokenizer
- Status: Actively maintained
- Performance: Best accuracy but slowest (12-450x slower than BlingFire depending on backend)
- Dependencies: Requires PyTorch/ONNX Runtime
Implementation
Following the existing pattern in afterthoughts/chunk.py:
def get_sentence_offsets_spacy(text: str) -> torch.Tensor:
"""Extract sentence offsets using spaCy."""
nlp = require_spacy() # Lazy import helper
doc = nlp(text)
offsets = torch.tensor([(sent.start_char, sent.end_char) for sent in doc.sents])
return offsets
def get_sentence_offsets_stanza(text: str) -> torch.Tensor:
"""Extract sentence offsets using Stanza."""
nlp = require_stanza() # Lazy import helper
doc = nlp(text)
offsets = torch.tensor([(sent.start_char, sent.end_char) for sent in doc.sentences])
return offsets
def get_sentence_offsets_wtpsplit(text: str) -> torch.Tensor:
"""Extract sentence offsets using wtpsplit."""
model = require_wtpsplit() # Lazy import helper
# Implementation details TBD
...
Add to pyproject.toml:
[project.optional-dependencies]
spacy = ["spacy>=3.0"]
stanza = ["stanza>=1.11"]
wtpsplit-ort-gpu = ["wtpsplit[ort-gpu]"]
wtpsplit-ort-cpu = ["wtpsplit[ort-cpu]"]
wtpsplit = ["wtpsplit"]
Update sent_tokenizer parameter to accept: "blingfire", "nltk", "pysbd", "syntok", "spacy", "stanza", "wtpsplit"
Tradeoffs
- spaCy: Requires model downloads but provides robust sentence detection with broader NLP capabilities
- Stanza: Heavy footprint but state-of-the-art accuracy for some languages
- wtpsplit: Best accuracy but significant performance penalty (acceptable for some use cases)
These are all optional dependencies, so users only install what they need. Having actively maintained alternatives reduces long-term maintenance risk and gives users flexibility to choose based on their specific accuracy/speed requirements.
Related
Motivation
As documented in #13, the current default sentence tokenizer (BlingFire) is unmaintained, and several of our optional tokenizers (syntok, pysbd) are also abandoned. While we're switching the default to NLTK Punkt, it would be valuable to support additional actively maintained alternatives to give users more options and reduce dependency on any single unmaintained package.
Proposed Solution
Add optional support for three actively maintained sentence tokenizers:
spaCy - Industrial-strength NLP library with robust sentence boundary detection
Stanza - Stanford NLP's neural pipeline
wtpsplit - ML-based SOTA sentence tokenizer
Implementation
Following the existing pattern in
afterthoughts/chunk.py:Add to
pyproject.toml:Update
sent_tokenizerparameter to accept:"blingfire","nltk","pysbd","syntok","spacy","stanza","wtpsplit"Tradeoffs
These are all optional dependencies, so users only install what they need. Having actively maintained alternatives reduces long-term maintenance risk and gives users flexibility to choose based on their specific accuracy/speed requirements.
Related