Add token-based chunking and parameter refactoring (v0.1.1)#8
Merged
Conversation
Add a new `max_chunk_tokens` parameter to `Encoder.encode()` that builds chunks by accumulating sentences until hitting the token limit, respecting sentence boundaries. This addresses issue #6 where chunks exceeding max_length were hard-truncated. Key changes: - Add `get_chunk_idx_by_tokens()` function in chunk.py for greedy sentence accumulation with token limit - Support combining `num_sents` and `max_chunk_tokens` - whichever limit is hit first stops the chunk - Update validation to handle `num_sents=None` when `max_chunk_tokens` is specified - Add comprehensive unit and integration tests Usage: - max_chunk_tokens alone: Greedy accumulation, no sentence limit - num_sents alone: Fixed sentence count per chunk (existing behavior) - Both: "At most N sentences AND at most M tokens"
- Add `split_long_sentences: bool = True` parameter to control whether sentences exceeding max_chunk_tokens are split into multiple chunks (enforcing the limit) or kept intact (potentially exceeding the limit) - Require `chunk_overlap` to be an integer when using max_chunk_tokens (overlap is in sentence units, so float doesn't make sense) - Add `is_split` tracking to identify split sentence chunks throughout the pipeline for proper text reconstruction via detokenization - Fix deduplication to preserve split chunks by including chunk_idx in the deduplication key for split chunks (they share sentence boundaries but represent different token ranges) - Update validation to enforce integer overlap with max_chunk_tokens - Add tests for split_long_sentences=True/False behavior and float overlap rejection
Add validation that fails fast when max_chunk_tokens > max_length, since chunks are built from token embeddings within sequences that are limited to max_length tokens. This prevents configuration errors where users expect larger chunks than possible.
- num_sents -> max_chunk_sents (parameter for max sentences per chunk) - split_long_sentences -> split_long_sents (shorter, consistent) - batch_tokens -> max_batch_tokens (matches max_chunk_* convention) - Output column remains num_sents (actual count, not the limit)
…efactoring - Add sequence index to warnings when sentences exceed max_chunk_tokens - Rename chunk_overlap -> chunk_overlap_sents in encode() signature - Rename prechunk_overlap -> prechunk_overlap_tokens in encode() signature - Update validation to support lists for both max_chunk_sents and max_chunk_tokens - Add validate_chunk_config_pair for validating (sents, tokens) pairs - Allow None in max_chunk_sents lists when max_chunk_tokens is specified - Simplify chunk_overlap_sents to int only (removed float/list/dict support) - Create test notebook for max_chunk_tokens feature WIP: Full parameter rename and cartesian product support in progress
Replace cartesian product with aligned lists for experimenting with multiple chunk configurations. When both max_chunk_sents and max_chunk_tokens are lists, they must have the same length and are processed as aligned pairs (not cartesian product). Core changes: - Replace nested loop with zip() for aligned pair generation - Add list length validation and smart scalar broadcasting - Add max_chunk_sents and max_chunk_tokens columns to output DataFrame - Update deduplication to include config in compound key Parameter renaming (completed): - chunk_overlap → chunk_overlap_sents (now int-only) - prechunk_overlap → prechunk_overlap_tokens Examples: - max_chunk_sents=[1,2,3], max_chunk_tokens=[64,128,256] → 3 configs - max_chunk_sents=[1,2,3], max_chunk_tokens=64 → broadcasts to 3 configs - Mismatched list lengths raise clear ValueError Documentation: - Updated README with advanced chunking strategies section - Added aligned lists examples and config filtering - Updated notebook with new test and proper Polars casting - All parameter names updated throughout Tests: All 96 tests passing (31 validation + 65 encode)
- Fix Object dtype casting error by converting sentinel values to list instead of using np.where - Clarify that split_long_sents only controls chunking at max_chunk_tokens boundary - Document automatic handling of sentences exceeding model max_length during prechunking
- Add global chunk index 'idx' as first column in both Polars and pandas DataFrames - Maps directly to embedding array rows for convenient filtering and indexing - Polars: use with_row_index(name="idx") - Pandas: use insert(0, "idx", range(len(df))) - Update documentation to show idx column usage
The idx column was being added before the DataFrame was sorted by sequence_idx, causing idx values to be shuffled with the rows. This resulted in idx not starting at 0 for document 0. Fix: Add idx column after all sorting is complete, ensuring it always starts at 0 and correctly maps to the final embedding array order.
- Add CI workflow with lint, typecheck, and test jobs across Python 3.10-3.12 - Fix encode() overload signatures to match implementation (parameter renames) - Update tests to use new parameter names (chunk_overlap_sents, prechunk_overlap_tokens) - Add accelerate to dev dependencies (required by transformers 5.0) - Disable reportMissingImports in pyright for optional dependencies
- Rename main class from Encoder to LateEncoder to better convey that it implements late chunking (embed first, chunk second) - Add deprecated Encoder alias with DeprecationWarning for backwards compatibility - Update all docstrings, examples, and tests to use LateEncoder
- Tighten README tagline to emphasize sentence-aware embeddings - Only include max_chunk_sents/max_chunk_tokens columns when parameter is specified - Update docs to explain token-based chunking and deduplication behavior - Bump version to 0.1.1
- Remove chunk_overlap_sents from README examples to downplay feature - Fix docstring parameter names (chunk_overlap → chunk_overlap_sents, prechunk_overlap → prechunk_overlap_tokens) - Remove outdated description of float/list/dict support for overlap - Remove example using chunk_overlap=0.5 from class docstring
- Update __init__.py to mention token-based chunking in key features - Fix encode.py _generate_chunk_embeds to clarify chunk_overlap_sents is int only - Improve tokenize.py prechunk_overlap_tokens description (float vs int) - Clarify DataFrame output columns are conditional on parameters specified
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces token-based chunking alongside the existing sentence-based approach, provides better control over long sentences, and includes parameter renaming for consistency.
Breaking Changes
Encoder→LateEncoder(deprecated alias provided for backward compatibility with warning)num_sents→max_chunk_sentschunk_overlap→chunk_overlap_sentsprechunk_overlap→prechunk_overlap_tokensNew Features
max_chunk_tokensparameter: Token-based chunking that greedily accumulates sentences until token limit is reachedsplit_long_sentsparameter: Control how sentences exceedingmax_chunk_tokensare handled (split or keep intact)idxcolumn: Added to DataFrames for direct embedding indexing (0-based)max_chunk_sentsandmax_chunk_tokensfor multi-configuration experimentsmax_chunk_sentsandmax_chunk_tokenscolumns in DataFrame (only included when specified)Improvements
idxcolumn to start at 0 (previously started at 1 after sorting)ValueErrorifmax_chunk_tokensexceeds model'smax_lengthDocumentation
Test plan
Encoderaliasmax_chunk_sents,max_chunk_tokens, aligned lists)split_long_sentsbehavior (True/False)