Skip to content

Add token-based chunking and parameter refactoring (v0.1.1)#8

Merged
429er merged 20 commits into
mainfrom
feature/max-chunk-tokens
Jan 27, 2026
Merged

Add token-based chunking and parameter refactoring (v0.1.1)#8
429er merged 20 commits into
mainfrom
feature/max-chunk-tokens

Conversation

@429er
Copy link
Copy Markdown
Owner

@429er 429er commented Jan 27, 2026

Summary

This PR introduces token-based chunking alongside the existing sentence-based approach, provides better control over long sentences, and includes parameter renaming for consistency.

Breaking Changes

  • Renamed EncoderLateEncoder (deprecated alias provided for backward compatibility with warning)
  • Renamed parameters for consistency:
    • num_sentsmax_chunk_sents
    • chunk_overlapchunk_overlap_sents
    • prechunk_overlapprechunk_overlap_tokens

New Features

  • max_chunk_tokens parameter: Token-based chunking that greedily accumulates sentences until token limit is reached
  • split_long_sents parameter: Control how sentences exceeding max_chunk_tokens are handled (split or keep intact)
  • idx column: Added to DataFrames for direct embedding indexing (0-based)
  • Aligned lists: Pair max_chunk_sents and max_chunk_tokens for multi-configuration experiments
  • Configuration tracking: max_chunk_sents and max_chunk_tokens columns in DataFrame (only included when specified)
  • GitHub CI workflow: Automated testing on push and PR

Improvements

  • Updated deduplication to include configuration parameters in compound key
  • Improved warning messages with sequence indices for better debugging
  • Enhanced docstrings for clarity and accuracy
  • Fixed DataFrame idx column to start at 0 (previously started at 1 after sorting)
  • Validation now raises ValueError if max_chunk_tokens exceeds model's max_length
  • Fixed deduplication for split chunks from long sentences
  • Updated README tagline to emphasize sentence-aware embeddings

Documentation

  • Comprehensive documentation of token-based chunking in README
  • Updated examples to showcase new features
  • Improved parameter descriptions in docstrings
  • Updated CHANGELOG.md for v0.1.1

Test plan

  • Pre-commit hooks pass (ruff, pyright, codespell)
  • GitHub CI workflow passes all tests
  • Backward compatibility with deprecated Encoder alias
  • Test with different chunk configurations (max_chunk_sents, max_chunk_tokens, aligned lists)
  • Verify split_long_sents behavior (True/False)
  • Confirm DataFrame columns are conditional on input parameters

429er added 20 commits January 23, 2026 00:02
Add a new `max_chunk_tokens` parameter to `Encoder.encode()` that builds
chunks by accumulating sentences until hitting the token limit, respecting
sentence boundaries. This addresses issue #6 where chunks exceeding
max_length were hard-truncated.

Key changes:
- Add `get_chunk_idx_by_tokens()` function in chunk.py for greedy sentence
  accumulation with token limit
- Support combining `num_sents` and `max_chunk_tokens` - whichever limit
  is hit first stops the chunk
- Update validation to handle `num_sents=None` when `max_chunk_tokens` is
  specified
- Add comprehensive unit and integration tests

Usage:
- max_chunk_tokens alone: Greedy accumulation, no sentence limit
- num_sents alone: Fixed sentence count per chunk (existing behavior)
- Both: "At most N sentences AND at most M tokens"
- Add `split_long_sentences: bool = True` parameter to control whether
  sentences exceeding max_chunk_tokens are split into multiple chunks
  (enforcing the limit) or kept intact (potentially exceeding the limit)

- Require `chunk_overlap` to be an integer when using max_chunk_tokens
  (overlap is in sentence units, so float doesn't make sense)

- Add `is_split` tracking to identify split sentence chunks throughout
  the pipeline for proper text reconstruction via detokenization

- Fix deduplication to preserve split chunks by including chunk_idx
  in the deduplication key for split chunks (they share sentence
  boundaries but represent different token ranges)

- Update validation to enforce integer overlap with max_chunk_tokens

- Add tests for split_long_sentences=True/False behavior and
  float overlap rejection
Add validation that fails fast when max_chunk_tokens > max_length,
since chunks are built from token embeddings within sequences that
are limited to max_length tokens. This prevents configuration errors
where users expect larger chunks than possible.
- num_sents -> max_chunk_sents (parameter for max sentences per chunk)
- split_long_sentences -> split_long_sents (shorter, consistent)
- batch_tokens -> max_batch_tokens (matches max_chunk_* convention)
- Output column remains num_sents (actual count, not the limit)
…efactoring

- Add sequence index to warnings when sentences exceed max_chunk_tokens
- Rename chunk_overlap -> chunk_overlap_sents in encode() signature
- Rename prechunk_overlap -> prechunk_overlap_tokens in encode() signature
- Update validation to support lists for both max_chunk_sents and max_chunk_tokens
- Add validate_chunk_config_pair for validating (sents, tokens) pairs
- Allow None in max_chunk_sents lists when max_chunk_tokens is specified
- Simplify chunk_overlap_sents to int only (removed float/list/dict support)
- Create test notebook for max_chunk_tokens feature

WIP: Full parameter rename and cartesian product support in progress
Replace cartesian product with aligned lists for experimenting with
multiple chunk configurations. When both max_chunk_sents and
max_chunk_tokens are lists, they must have the same length and are
processed as aligned pairs (not cartesian product).

Core changes:
- Replace nested loop with zip() for aligned pair generation
- Add list length validation and smart scalar broadcasting
- Add max_chunk_sents and max_chunk_tokens columns to output DataFrame
- Update deduplication to include config in compound key

Parameter renaming (completed):
- chunk_overlap → chunk_overlap_sents (now int-only)
- prechunk_overlap → prechunk_overlap_tokens

Examples:
- max_chunk_sents=[1,2,3], max_chunk_tokens=[64,128,256] → 3 configs
- max_chunk_sents=[1,2,3], max_chunk_tokens=64 → broadcasts to 3 configs
- Mismatched list lengths raise clear ValueError

Documentation:
- Updated README with advanced chunking strategies section
- Added aligned lists examples and config filtering
- Updated notebook with new test and proper Polars casting
- All parameter names updated throughout

Tests: All 96 tests passing (31 validation + 65 encode)
- Fix Object dtype casting error by converting sentinel values to list instead of using np.where
- Clarify that split_long_sents only controls chunking at max_chunk_tokens boundary
- Document automatic handling of sentences exceeding model max_length during prechunking
- Add global chunk index 'idx' as first column in both Polars and pandas DataFrames
- Maps directly to embedding array rows for convenient filtering and indexing
- Polars: use with_row_index(name="idx")
- Pandas: use insert(0, "idx", range(len(df)))
- Update documentation to show idx column usage
The idx column was being added before the DataFrame was sorted by sequence_idx,
causing idx values to be shuffled with the rows. This resulted in idx not
starting at 0 for document 0.

Fix: Add idx column after all sorting is complete, ensuring it always starts
at 0 and correctly maps to the final embedding array order.
- Add CI workflow with lint, typecheck, and test jobs across Python 3.10-3.12
- Fix encode() overload signatures to match implementation (parameter renames)
- Update tests to use new parameter names (chunk_overlap_sents, prechunk_overlap_tokens)
- Add accelerate to dev dependencies (required by transformers 5.0)
- Disable reportMissingImports in pyright for optional dependencies
- Rename main class from Encoder to LateEncoder to better convey that
  it implements late chunking (embed first, chunk second)
- Add deprecated Encoder alias with DeprecationWarning for backwards
  compatibility
- Update all docstrings, examples, and tests to use LateEncoder
- Tighten README tagline to emphasize sentence-aware embeddings
- Only include max_chunk_sents/max_chunk_tokens columns when parameter is specified
- Update docs to explain token-based chunking and deduplication behavior
- Bump version to 0.1.1
- Remove chunk_overlap_sents from README examples to downplay feature
- Fix docstring parameter names (chunk_overlap → chunk_overlap_sents,
  prechunk_overlap → prechunk_overlap_tokens)
- Remove outdated description of float/list/dict support for overlap
- Remove example using chunk_overlap=0.5 from class docstring
- Update __init__.py to mention token-based chunking in key features
- Fix encode.py _generate_chunk_embeds to clarify chunk_overlap_sents is int only
- Improve tokenize.py prechunk_overlap_tokens description (float vs int)
- Clarify DataFrame output columns are conditional on parameters specified
@429er 429er merged commit c8a4111 into main Jan 27, 2026
6 checks passed
@429er 429er deleted the feature/max-chunk-tokens branch January 27, 2026 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant