Add token-based chunking and parameter refactoring (v0.1.1) by 429er · Pull Request #8 · 429er/afterthoughts

429er · 2026-01-27T00:52:07Z

Summary

This PR introduces token-based chunking alongside the existing sentence-based approach, provides better control over long sentences, and includes parameter renaming for consistency.

Breaking Changes

Renamed Encoder → LateEncoder (deprecated alias provided for backward compatibility with warning)
Renamed parameters for consistency:
- num_sents → max_chunk_sents
- chunk_overlap → chunk_overlap_sents
- prechunk_overlap → prechunk_overlap_tokens

New Features

max_chunk_tokens parameter: Token-based chunking that greedily accumulates sentences until token limit is reached
split_long_sents parameter: Control how sentences exceeding max_chunk_tokens are handled (split or keep intact)
idx column: Added to DataFrames for direct embedding indexing (0-based)
Aligned lists: Pair max_chunk_sents and max_chunk_tokens for multi-configuration experiments
Configuration tracking: max_chunk_sents and max_chunk_tokens columns in DataFrame (only included when specified)
GitHub CI workflow: Automated testing on push and PR

Improvements

Updated deduplication to include configuration parameters in compound key
Improved warning messages with sequence indices for better debugging
Enhanced docstrings for clarity and accuracy
Fixed DataFrame idx column to start at 0 (previously started at 1 after sorting)
Validation now raises ValueError if max_chunk_tokens exceeds model's max_length
Fixed deduplication for split chunks from long sentences
Updated README tagline to emphasize sentence-aware embeddings

Documentation

Comprehensive documentation of token-based chunking in README
Updated examples to showcase new features
Improved parameter descriptions in docstrings
Updated CHANGELOG.md for v0.1.1

Test plan

Pre-commit hooks pass (ruff, pyright, codespell)
GitHub CI workflow passes all tests
Backward compatibility with deprecated Encoder alias
Test with different chunk configurations (max_chunk_sents, max_chunk_tokens, aligned lists)
Verify split_long_sents behavior (True/False)
Confirm DataFrame columns are conditional on input parameters

Add a new `max_chunk_tokens` parameter to `Encoder.encode()` that builds chunks by accumulating sentences until hitting the token limit, respecting sentence boundaries. This addresses issue #6 where chunks exceeding max_length were hard-truncated. Key changes: - Add `get_chunk_idx_by_tokens()` function in chunk.py for greedy sentence accumulation with token limit - Support combining `num_sents` and `max_chunk_tokens` - whichever limit is hit first stops the chunk - Update validation to handle `num_sents=None` when `max_chunk_tokens` is specified - Add comprehensive unit and integration tests Usage: - max_chunk_tokens alone: Greedy accumulation, no sentence limit - num_sents alone: Fixed sentence count per chunk (existing behavior) - Both: "At most N sentences AND at most M tokens"

- Add `split_long_sentences: bool = True` parameter to control whether sentences exceeding max_chunk_tokens are split into multiple chunks (enforcing the limit) or kept intact (potentially exceeding the limit) - Require `chunk_overlap` to be an integer when using max_chunk_tokens (overlap is in sentence units, so float doesn't make sense) - Add `is_split` tracking to identify split sentence chunks throughout the pipeline for proper text reconstruction via detokenization - Fix deduplication to preserve split chunks by including chunk_idx in the deduplication key for split chunks (they share sentence boundaries but represent different token ranges) - Update validation to enforce integer overlap with max_chunk_tokens - Add tests for split_long_sentences=True/False behavior and float overlap rejection

Add validation that fails fast when max_chunk_tokens > max_length, since chunks are built from token embeddings within sequences that are limited to max_length tokens. This prevents configuration errors where users expect larger chunks than possible.

- num_sents -> max_chunk_sents (parameter for max sentences per chunk) - split_long_sentences -> split_long_sents (shorter, consistent) - batch_tokens -> max_batch_tokens (matches max_chunk_* convention) - Output column remains num_sents (actual count, not the limit)

…efactoring - Add sequence index to warnings when sentences exceed max_chunk_tokens - Rename chunk_overlap -> chunk_overlap_sents in encode() signature - Rename prechunk_overlap -> prechunk_overlap_tokens in encode() signature - Update validation to support lists for both max_chunk_sents and max_chunk_tokens - Add validate_chunk_config_pair for validating (sents, tokens) pairs - Allow None in max_chunk_sents lists when max_chunk_tokens is specified - Simplify chunk_overlap_sents to int only (removed float/list/dict support) - Create test notebook for max_chunk_tokens feature WIP: Full parameter rename and cartesian product support in progress

Replace cartesian product with aligned lists for experimenting with multiple chunk configurations. When both max_chunk_sents and max_chunk_tokens are lists, they must have the same length and are processed as aligned pairs (not cartesian product). Core changes: - Replace nested loop with zip() for aligned pair generation - Add list length validation and smart scalar broadcasting - Add max_chunk_sents and max_chunk_tokens columns to output DataFrame - Update deduplication to include config in compound key Parameter renaming (completed): - chunk_overlap → chunk_overlap_sents (now int-only) - prechunk_overlap → prechunk_overlap_tokens Examples: - max_chunk_sents=[1,2,3], max_chunk_tokens=[64,128,256] → 3 configs - max_chunk_sents=[1,2,3], max_chunk_tokens=64 → broadcasts to 3 configs - Mismatched list lengths raise clear ValueError Documentation: - Updated README with advanced chunking strategies section - Added aligned lists examples and config filtering - Updated notebook with new test and proper Polars casting - All parameter names updated throughout Tests: All 96 tests passing (31 validation + 65 encode)

- Fix Object dtype casting error by converting sentinel values to list instead of using np.where - Clarify that split_long_sents only controls chunking at max_chunk_tokens boundary - Document automatic handling of sentences exceeding model max_length during prechunking

- Add global chunk index 'idx' as first column in both Polars and pandas DataFrames - Maps directly to embedding array rows for convenient filtering and indexing - Polars: use with_row_index(name="idx") - Pandas: use insert(0, "idx", range(len(df))) - Update documentation to show idx column usage

The idx column was being added before the DataFrame was sorted by sequence_idx, causing idx values to be shuffled with the rows. This resulted in idx not starting at 0 for document 0. Fix: Add idx column after all sorting is complete, ensuring it always starts at 0 and correctly maps to the final embedding array order.

- Add CI workflow with lint, typecheck, and test jobs across Python 3.10-3.12 - Fix encode() overload signatures to match implementation (parameter renames) - Update tests to use new parameter names (chunk_overlap_sents, prechunk_overlap_tokens) - Add accelerate to dev dependencies (required by transformers 5.0) - Disable reportMissingImports in pyright for optional dependencies

- Rename main class from Encoder to LateEncoder to better convey that it implements late chunking (embed first, chunk second) - Add deprecated Encoder alias with DeprecationWarning for backwards compatibility - Update all docstrings, examples, and tests to use LateEncoder

- Tighten README tagline to emphasize sentence-aware embeddings - Only include max_chunk_sents/max_chunk_tokens columns when parameter is specified - Update docs to explain token-based chunking and deduplication behavior - Bump version to 0.1.1

- Remove chunk_overlap_sents from README examples to downplay feature - Fix docstring parameter names (chunk_overlap → chunk_overlap_sents, prechunk_overlap → prechunk_overlap_tokens) - Remove outdated description of float/list/dict support for overlap - Remove example using chunk_overlap=0.5 from class docstring

- Update __init__.py to mention token-based chunking in key features - Fix encode.py _generate_chunk_embeds to clarify chunk_overlap_sents is int only - Improve tokenize.py prechunk_overlap_tokens description (float vs int) - Clarify DataFrame output columns are conditional on parameters specified

429er added 20 commits January 23, 2026 00:02

Update README example to show idx and num_sents columns

d544c36

Ignore test notebooks in repo root

f36aeb3

Add CI, PyPI, Python version, and license badges to README

53da526

Add LateEncoder rename to changelog

3f7da6f

Fix CI workflow to use virtual environment instead of --system flag

f4ee318

Fix CI: create venv before install and use uv run for commands

d46368e

429er merged commit c8a4111 into main Jan 27, 2026
6 checks passed

429er deleted the feature/max-chunk-tokens branch January 27, 2026 00:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add token-based chunking and parameter refactoring (v0.1.1)#8

Add token-based chunking and parameter refactoring (v0.1.1)#8
429er merged 20 commits into
mainfrom
feature/max-chunk-tokens

429er commented Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

429er commented Jan 27, 2026

Summary

Breaking Changes

New Features

Improvements

Documentation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant