feat(rag): add chunk_text_with_overlap utility for adaptive text chunking#462
feat(rag): add chunk_text_with_overlap utility for adaptive text chunking#462vipul674 wants to merge 3 commits into
Conversation
…king Adds a pure-text utility that splits input into chunks of roughly chunk_size characters with a sliding chunk_overlap window, snapping each chunk's end to the latest natural boundary (paragraph break, sentence terminal) within a chunk_size // 2 lookahead. The result is boundary-aware chunks that preserve cross-boundary context for the vector store while staying close to the target size. This is purely additive and does not modify any existing call sites; consumers can opt in by wiring the new function into the chunking pipeline behind a feature flag or A/B test before the legacy pass is retired. Refs FireFistisDead#247
|
@vipul674 is attempting to deploy a commit to the firefistisdead's projects Team on Vercel. A member of the Team first needs to authorize it. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR adds an adaptive text chunker ChangesText Chunking with Overlap Support
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds an adaptive, boundary-aware text chunking helper to the PDF extractor and introduces unit tests to validate chunk sizing, overlap behavior, boundary preference, and input validation.
Changes:
- Added
chunk_text_with_overlap()with paragraph/sentence boundary detection and overlap-based sliding windows. - Introduced default boundary regex patterns and argument validation for chunking.
- Added pytest unit tests covering common and edge-case chunking behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| rag-service/crawler/pdf_extractor.py | Implements boundary-aware chunking with overlap and default boundary patterns. |
| rag-service/test_pdf_extractor_chunking.py | Adds unit tests to verify chunking behavior, overlap, boundary selection, and validation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| window_end = min(naive_end + lookahead_max, text_len) | ||
| # Walk boundary patterns in priority order; the first pattern | ||
| # that yields a match wins, but we always pick the latest | ||
| # match within the lookahead so the chunk stays close to | ||
| # `chunk_size` characters. | ||
| best_offset = -1 | ||
| for pattern in compiled: | ||
| for match in pattern.finditer(text, naive_end, window_end): | ||
| offset = match.end() - start | ||
| if offset > best_offset: | ||
| best_offset = offset | ||
| if best_offset > 0: | ||
| break | ||
| if best_offset > 0: | ||
| chunk_end = start + best_offset |
There was a problem hiding this comment.
Acknowledged. Took the docstring-only fix in commit f1795c0. The implementation is unchanged — the boundary search still extends forward from naive_end, but the docstring now states this explicitly. A boundary at or before naive_end search would change behaviour (smaller chunks, more hard cuts at the start side) and is deferred to a follow-up PR per the maintainer's discretion. 12/12 tests still pass.
There was a problem hiding this comment.
Fixed in commit 16963ab. The Returns section now spells out the lstrip-only behaviour and the rationale (preserving the boundary used to align the chunk end verbatim for the next chunk's overlap window). 10/10 tests still pass.
| def chunk_text_with_overlap( | ||
| text: str, | ||
| *, | ||
| chunk_size: int = 800, | ||
| chunk_overlap: int = 200, | ||
| boundary_patterns: tuple[str, ...] = _DEFAULT_BOUNDARY_PATTERNS, | ||
| ) -> list[str]: |
There was a problem hiding this comment.
Acknowledged. Tightening the type annotation to Optional[str] is a public-API change (callers that relied on the implicit empty-check might pass None expecting an error rather than a silent []). Out of scope for the docstring-only round per the maintainer's discretion. Will open a follow-up PR if the maintainer wants to tighten the type. No code change in this PR.
| Returns: | ||
| List of non-empty stripped text chunks. The first chunk starts | ||
| at the beginning of the input; the last chunk absorbs any | ||
| remaining text up to the end of the input. |
There was a problem hiding this comment.
Fixed in commit 16963ab. The Returns section now says lstrip (leading whitespace only) and explains the reason for preserving trailing boundary whitespace verbatim. 10/10 tests still pass.
| # Stripping the end would erase the very signal we used to align | ||
| # the chunk and would also make overlapping tails diverge from the | ||
| # original text. | ||
| chunk = text[start:chunk_end].lstrip() |
There was a problem hiding this comment.
Fixed in commit 16963ab (same fix as the line-156 thread). 10/10 tests still pass.
| def test_boundary_detection_prefers_sentence_terminator(): | ||
| """The chunk should end at a sentence boundary, not mid-sentence.""" | ||
| text = _build_text(50) | ||
| chunks = chunk_text_with_overlap(text, chunk_size=300, chunk_overlap=60) | ||
| # All but the final chunk should end with a sentence-terminal | ||
| # punctuation (".", "?", "!") or a paragraph break, never a partial word. | ||
| for chunk in chunks[:-1]: | ||
| stripped = chunk.rstrip() | ||
| assert stripped[-1] in ".!?", ( | ||
| f"Chunk should end at a sentence boundary, got: {stripped[-30:]!r}" | ||
| ) |
There was a problem hiding this comment.
Acknowledged. Adding a regression test for the boundary before naive_end case is a test-suite change, not a docstring change, so I'm leaving it for a follow-up PR per the docstring-only scope of this round. Will open a separate PR with the additional test once this one lands. The current 10-test suite still passes.
|
@FireFistisDead please review the pr and let me know if any changes are needed, if not please merge |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
rag-service/test_pdf_extractor_chunking.py (1)
47-54: 💤 Low valueOptional: use
itertools.pairwisefor the consecutive-pair iteration.This sidesteps the Ruff B905 warning (
zipwithoutstrict=) cleanly — note thatstrict=Truewould be wrong here since the two operands intentionally differ in length.♻️ Proposed tweak
+from itertools import pairwise + - for prev, nxt in zip(chunks, chunks[1:]): + for prev, nxt in pairwise(chunks):🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rag-service/test_pdf_extractor_chunking.py` around lines 47 - 54, Replace the manual consecutive-pair iteration using zip(chunks, chunks[1:]) with itertools.pairwise to avoid the Ruff B905 warning; in the test that iterates over chunks (variables chunks, overlap, tail) use pairwise(chunks) so the loop semantics (comparing prev and nxt, computing tail = prev[-overlap:].rstrip(), and asserting tail appears in nxt) remain identical while eliminating the zip-with-slice pattern.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@rag-service/crawler/pdf_extractor.py`:
- Around line 170-190: The implementation around naive_end/window_end currently
extends chunks forward (using compiled pattern.finditer over
text[naive_end:window_end] and picking the rightmost match), so chunk_size acts
as a floor not a cap; either update the documentation/comments to state that
chunks may reach chunk_size + chunk_size // 2, or enforce a hard cap by changing
the boundary search to scan backward in the window [start, naive_end] and pick
the latest match ≤ naive_end (instead of scanning forward from naive_end to
window_end). Locate the logic using variables chunk_size, naive_end, window_end
and the loop over compiled patterns and apply the chosen fix consistently (doc
string/comments around target maximum and the matching loop behavior).
---
Nitpick comments:
In `@rag-service/test_pdf_extractor_chunking.py`:
- Around line 47-54: Replace the manual consecutive-pair iteration using
zip(chunks, chunks[1:]) with itertools.pairwise to avoid the Ruff B905 warning;
in the test that iterates over chunks (variables chunks, overlap, tail) use
pairwise(chunks) so the loop semantics (comparing prev and nxt, computing tail =
prev[-overlap:].rstrip(), and asserting tail appears in nxt) remain identical
while eliminating the zip-with-slice pattern.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: df83c2f8-e744-4068-a086-090a7a125172
📒 Files selected for processing (2)
rag-service/crawler/pdf_extractor.pyrag-service/test_pdf_extractor_chunking.py
CodeRabbit review on PR FireFistisDead#462 noted that the boundary search extends the chunk end forward (not backward), so a chunk that finds a natural boundary in the lookahead can reach chunk_size + chunk_size // 2 characters. The previous docstring described chunk_size as a 'maximum' which is misleading. Update the docstring to: - State that chunk_size is a soft target, not a hard cap - Document the actual upper bound (chunk_size + chunk_size // 2) - Explain the trade-off: extending forward keeps the chunk start aligned with chunk_size on a regular cadence, while the end snaps to a natural break - Note the boundary_patterns arg's behavior in the matching language (was inconsistent with the 'rightmost-in-priority' description) No code change. No test change. All 12 tests still pass.
- Module-level boundary-search comment still described the rejected 'right-to-left walks' approach. Reworded to describe the actual forward-extension behaviour with the latest-match-in-lookahead tiebreaker, and point to the function docstring for the full sizing semantics. - Returns section said 'non-empty stripped' but the code only does an lstrip on the start. Tightened to spell out lstrip-vs-rstrip asymmetry and the reason (preserving the boundary used to align the chunk end for the next chunk's overlap window). No code change. 10/10 chunking tests still pass.
| def chunk_text_with_overlap( | ||
| text: str, | ||
| *, | ||
| chunk_size: int = 800, | ||
| chunk_overlap: int = 200, | ||
| boundary_patterns: tuple[str, ...] = _DEFAULT_BOUNDARY_PATTERNS, | ||
| ) -> list[str]: |
| if not text or not text.strip(): | ||
| return [] |
| for prev, nxt in zip(chunks, chunks[1:]): | ||
| # At least `overlap` chars of the prev chunk's tail must appear | ||
| # in the next chunk. We tolerate trailing whitespace stripping. | ||
| tail = prev[-overlap:].rstrip() | ||
| assert tail[:20] in nxt, ( | ||
| f"Expected overlap tail to appear in next chunk.\n" | ||
| f"prev tail: {tail!r}\nnext head: {nxt[:120]!r}" | ||
| ) |
| # budget (chunk_size // 2) — that proves the algorithm didn't | ||
| # accidentally grow a single chunk to the full input length. | ||
| for chunk in chunks: | ||
| assert len(chunk) <= 200 + (200 // 2) + 16 # 16 = strip slack |
|
@FireFistisDead Please review |
|
@FireFistisDead Please review and let me know for any changes. |
What
Adds a new pure-text utility
chunk_text_with_overlapinrag-service/crawler/pdf_extractor.pythat splits input into chunks of roughlychunk_sizecharacters with a slidingchunk_overlapwindow, snapping each chunk's end to the latest natural boundary (paragraph break, sentence terminal) within achunk_size // 2lookahead.Why
The current pipeline hard-cuts text at fixed character offsets, slicing sentences and even words mid-token. That breaks retrieval quality because the embedding model sees partial sentences whose semantic content is lost. Boundary-aware chunking with overlap preserves cross-boundary context (a sentence that ends at the tail of one chunk is fully visible at the head of the next) while keeping each chunk close to the target size.
Scope
Purely additive. No existing call sites are modified, no behavior changes for current consumers, no breaking API. The new function lives next to the other PDF/text helpers in
pdf_extractor.pyand can be wired into the chunking pipeline behind a feature flag or A/B test before the legacy pass is retired.Design notes
\n{2,}) wins over sentence terminal ((?<=[.!?])\s+). The first pattern that yields a match in the lookahead wins.finditerand picks the latest match within thechunk_size // 2lookahead so the chunk stays close to the target size rather than collapsing at the first boundary.chunk_sizeand advances the cursor at least one character. This handles pathological inputs (no whitespace, no sentence terminals) without infinite loops.chunk_overlapcharacters back from the previous chunk's end, so consecutive chunks share a tail of approximatelychunk_overlapcharacters. The overlap is not re-snapped to a boundary — that would defeat the purpose of preserving the original tail.lstriponly on the start: the end of the chunk is preserved verbatim so the natural boundary (e.g. trailing\n\n) stays in the text. This matters for any downstream tokenizer or sentence-segmenter that expects real whitespace cues.Tests
10 new tests in
test_pdf_extractor_chunking.py:test_empty_input_returns_empty_list—None,"", whitespace-only inputs return[].test_short_input_returns_single_chunk— inputs shorter thanchunk_sizereturn a single chunk.test_long_input_is_split_into_multiple_chunks— long input is split, chunks stay near target size.test_consecutive_chunks_share_overlap_region— overlap region is present in both adjacent chunks.test_boundary_detection_prefers_sentence_terminator— single-period lookahead is honored.test_paragraph_breaks_take_priority_over_sentence_terminals— blank-line break wins over period.test_forward_progress_guaranteed_on_oversized_input— no-boundary input still terminates and advances.test_invalid_arguments_raise_value_error—chunk_size <= 0, negative overlap, overlap >= size all raise.test_custom_boundary_patterns_are_honored— caller-provided regex overrides defaults.test_no_overlap_when_chunk_overlap_is_zero— zero-overlap config produces disjoint chunks that reconstruct the input.All 10 pass. Existing
test_crawler.pystill passes (12/12 total inrag-service/).Out of scope
chunk_text_with_overlapinto the livemain.pychunking pipeline — this PR only adds the utility. A follow-up PR can gate it behind a feature flag and A/B test retrieval quality before the legacy pass is retired.Refs
Closes #247
Summary by CodeRabbit
New Features
Tests