Add support for actively maintained sentence tokenizers (spaCy, Stanza, wtpsplit)

## Motivation

As documented in #13, the current default sentence tokenizer (BlingFire) is unmaintained, and several of our optional tokenizers (syntok, pysbd) are also abandoned. While we're switching the default to NLTK Punkt, it would be valuable to support additional actively maintained alternatives to give users more options and reduce dependency on any single unmaintained package.

## Proposed Solution

Add optional support for three actively maintained sentence tokenizers:

1. **spaCy** - Industrial-strength NLP library with robust sentence boundary detection
   - Status: Actively maintained
   - Performance: Fast (Cython-optimized)
   - Dependencies: Requires spaCy + language model download
   
2. **Stanza** - Stanford NLP's neural pipeline
   - Status: Actively maintained (v1.11.0, October 2025)
   - Performance: Accurate but heavier than rule-based alternatives
   - Dependencies: Requires PyTorch (already a dependency) + model downloads
   
3. **wtpsplit** - ML-based SOTA sentence tokenizer
   - Status: Actively maintained
   - Performance: Best accuracy but slowest (12-450x slower than BlingFire depending on backend)
   - Dependencies: Requires PyTorch/ONNX Runtime

## Implementation

Following the existing pattern in `afterthoughts/chunk.py`:

```python
def get_sentence_offsets_spacy(text: str) -> torch.Tensor:
    """Extract sentence offsets using spaCy."""
    nlp = require_spacy()  # Lazy import helper
    doc = nlp(text)
    offsets = torch.tensor([(sent.start_char, sent.end_char) for sent in doc.sents])
    return offsets

def get_sentence_offsets_stanza(text: str) -> torch.Tensor:
    """Extract sentence offsets using Stanza."""
    nlp = require_stanza()  # Lazy import helper
    doc = nlp(text)
    offsets = torch.tensor([(sent.start_char, sent.end_char) for sent in doc.sentences])
    return offsets

def get_sentence_offsets_wtpsplit(text: str) -> torch.Tensor:
    """Extract sentence offsets using wtpsplit."""
    model = require_wtpsplit()  # Lazy import helper
    # Implementation details TBD
    ...
```

Add to `pyproject.toml`:
```toml
[project.optional-dependencies]
spacy = ["spacy>=3.0"]
stanza = ["stanza>=1.11"]
wtpsplit-ort-gpu = ["wtpsplit[ort-gpu]"]
wtpsplit-ort-cpu = ["wtpsplit[ort-cpu]"]
wtpsplit = ["wtpsplit"]
```

Update `sent_tokenizer` parameter to accept: `"blingfire"`, `"nltk"`, `"pysbd"`, `"syntok"`, `"spacy"`, `"stanza"`, `"wtpsplit"`

## Tradeoffs

- **spaCy**: Requires model downloads but provides robust sentence detection with broader NLP capabilities
- **Stanza**: Heavy footprint but state-of-the-art accuracy for some languages
- **wtpsplit**: Best accuracy but significant performance penalty (acceptable for some use cases)

These are all optional dependencies, so users only install what they need. Having actively maintained alternatives reduces long-term maintenance risk and gives users flexibility to choose based on their specific accuracy/speed requirements.

## Related

- #13 - Switch default to NLTK Punkt due to unmaintained BlingFire
- Benchmark results: https://github.com/ndgigliotti/sentence-tokenizer-bench

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for actively maintained sentence tokenizers (spaCy, Stanza, wtpsplit) #15

Motivation

Proposed Solution

Implementation

Tradeoffs

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add support for actively maintained sentence tokenizers (spaCy, Stanza, wtpsplit) #15

Description

Motivation

Proposed Solution

Implementation

Tradeoffs

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions