Skip to content

feat(rag): add chunk_text_with_overlap utility for adaptive text chunking#462

Open
vipul674 wants to merge 3 commits into
FireFistisDead:masterfrom
vipul674:fix/issue-247-adaptive-chunking-upstream
Open

feat(rag): add chunk_text_with_overlap utility for adaptive text chunking#462
vipul674 wants to merge 3 commits into
FireFistisDead:masterfrom
vipul674:fix/issue-247-adaptive-chunking-upstream

Conversation

@vipul674
Copy link
Copy Markdown

@vipul674 vipul674 commented Jun 1, 2026

What

Adds a new pure-text utility chunk_text_with_overlap in rag-service/crawler/pdf_extractor.py that splits input into chunks of roughly chunk_size characters with a sliding chunk_overlap window, snapping each chunk's end to the latest natural boundary (paragraph break, sentence terminal) within a chunk_size // 2 lookahead.

Why

The current pipeline hard-cuts text at fixed character offsets, slicing sentences and even words mid-token. That breaks retrieval quality because the embedding model sees partial sentences whose semantic content is lost. Boundary-aware chunking with overlap preserves cross-boundary context (a sentence that ends at the tail of one chunk is fully visible at the head of the next) while keeping each chunk close to the target size.

Scope

Purely additive. No existing call sites are modified, no behavior changes for current consumers, no breaking API. The new function lives next to the other PDF/text helpers in pdf_extractor.py and can be wired into the chunking pipeline behind a feature flag or A/B test before the legacy pass is retired.

Design notes

  • Boundary priority: paragraph break (\n{2,}) wins over sentence terminal ((?<=[.!?])\s+). The first pattern that yields a match in the lookahead wins.
  • Rightmost match: the algorithm walks finditer and picks the latest match within the chunk_size // 2 lookahead so the chunk stays close to the target size rather than collapsing at the first boundary.
  • Forward-progress guarantee: when no boundary is found in the lookahead, the algorithm falls back to a hard cut at exactly chunk_size and advances the cursor at least one character. This handles pathological inputs (no whitespace, no sentence terminals) without infinite loops.
  • Overlap: the next chunk starts chunk_overlap characters back from the previous chunk's end, so consecutive chunks share a tail of approximately chunk_overlap characters. The overlap is not re-snapped to a boundary — that would defeat the purpose of preserving the original tail.
  • lstrip only on the start: the end of the chunk is preserved verbatim so the natural boundary (e.g. trailing \n\n) stays in the text. This matters for any downstream tokenizer or sentence-segmenter that expects real whitespace cues.
  • Custom boundaries: callers can pass their own tuple of boundary patterns (e.g. markdown section breaks, code-fence delimiters) for domain-specific splitting.

Tests

10 new tests in test_pdf_extractor_chunking.py:

  • test_empty_input_returns_empty_listNone, "", whitespace-only inputs return [].
  • test_short_input_returns_single_chunk — inputs shorter than chunk_size return a single chunk.
  • test_long_input_is_split_into_multiple_chunks — long input is split, chunks stay near target size.
  • test_consecutive_chunks_share_overlap_region — overlap region is present in both adjacent chunks.
  • test_boundary_detection_prefers_sentence_terminator — single-period lookahead is honored.
  • test_paragraph_breaks_take_priority_over_sentence_terminals — blank-line break wins over period.
  • test_forward_progress_guaranteed_on_oversized_input — no-boundary input still terminates and advances.
  • test_invalid_arguments_raise_value_errorchunk_size <= 0, negative overlap, overlap >= size all raise.
  • test_custom_boundary_patterns_are_honored — caller-provided regex overrides defaults.
  • test_no_overlap_when_chunk_overlap_is_zero — zero-overlap config produces disjoint chunks that reconstruct the input.

All 10 pass. Existing test_crawler.py still passes (12/12 total in rag-service/).

Out of scope

  • Wiring chunk_text_with_overlap into the live main.py chunking pipeline — this PR only adds the utility. A follow-up PR can gate it behind a feature flag and A/B test retrieval quality before the legacy pass is retired.
  • Token-aware chunking (tiktoken, sentencepiece) — that requires a tokenizer dep and is a separate concern.
  • Streaming chunking for very large inputs — current implementation holds the full text in memory; PDFs are bounded so this is fine for now.

Refs

Closes #247

Summary by CodeRabbit

  • New Features

    • Improved document processing with intelligent text chunking that prefers paragraph and sentence boundaries, supports configurable chunk size and overlap, avoids empty chunks, and guarantees forward progress on very long segments.
  • Tests

    • Added comprehensive tests for empty/short input, multi-chunk splitting, overlap behavior, boundary priority, custom boundaries, invalid-argument handling, and reconstruction without overlap.

…king

Adds a pure-text utility that splits input into chunks of roughly
chunk_size characters with a sliding chunk_overlap window, snapping
each chunk's end to the latest natural boundary (paragraph break,
sentence terminal) within a chunk_size // 2 lookahead. The result is
boundary-aware chunks that preserve cross-boundary context for the
vector store while staying close to the target size.

This is purely additive and does not modify any existing call sites;
consumers can opt in by wiring the new function into the chunking
pipeline behind a feature flag or A/B test before the legacy pass is
retired.

Refs FireFistisDead#247
Copilot AI review requested due to automatic review settings June 1, 2026 18:19
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 1, 2026

@vipul674 is attempting to deploy a commit to the firefistisdead's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 98098460-080e-4d20-84e9-de3a6bb698c6

📥 Commits

Reviewing files that changed from the base of the PR and between f1795c0 and 16963ab.

📒 Files selected for processing (1)
  • rag-service/crawler/pdf_extractor.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • rag-service/crawler/pdf_extractor.py

📝 Walkthrough

Walkthrough

This PR adds an adaptive text chunker chunk_text_with_overlap that splits text into overlapping character chunks using prioritized regex boundaries (paragraphs then sentence terminators) and a comprehensive pytest suite validating edge cases, overlap, boundary priority, custom patterns, and termination.

Changes

Text Chunking with Overlap Support

Layer / File(s) Summary
Chunking algorithm and boundary patterns
rag-service/crawler/pdf_extractor.py
Adds re import, defines _DEFAULT_BOUNDARY_PATTERNS (blank lines prioritized over sentence terminators), and implements chunk_text_with_overlap with input validation, regex pre-compilation, lookahead-based boundary selection, overlap-aware advancement, and forward-progress guarantees.
Comprehensive test coverage for chunking behavior
rag-service/test_pdf_extractor_chunking.py
Test module with 10 test cases covering empty input and None, short/long text splitting, consecutive-chunk overlap verification, sentence vs. mid-sentence cuts, paragraph-priority, forward-progress on oversized input, invalid-argument checks, custom boundary pattern handling, and reconstruction when chunk_overlap=0.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

level:intermediate

Poem

🐰 I nibble at text, one chunk at a time,
I favor blank lines and sentence-end rhyme.
Overlaps stitch meaning tight and neat,
No orphaned phrases left incomplete.
Hop—context preserved, a tidy feat!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 69.23% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: addition of a new chunk_text_with_overlap utility function for adaptive text chunking with boundary awareness and overlap windowing.
Description check ✅ Passed The description is comprehensive and well-structured, covering what changed, why it matters, design rationale, testing coverage, and scope boundaries, though it deviates from the template format by using custom sections.
Linked Issues check ✅ Passed The implementation fully addresses all coding requirements from issue #247: sliding-window overlap, smart boundary detection (paragraph breaks prioritized over sentence terminals), adaptive chunk sizing, forward-progress guarantee, and custom boundary pattern support.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the stated objective: implementation of chunk_text_with_overlap utility and comprehensive unit tests. No unrelated modifications or pipeline integration changes present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added backend Express or API gateway work bug Something isn't working feature A new feature or improvement fix A targeted fix or cleanup frontend Frontend-related work invalid This doesn't seem right level:advanced rag-service FastAPI / model service work type:testing labels Jun 1, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an adaptive, boundary-aware text chunking helper to the PDF extractor and introduces unit tests to validate chunk sizing, overlap behavior, boundary preference, and input validation.

Changes:

  • Added chunk_text_with_overlap() with paragraph/sentence boundary detection and overlap-based sliding windows.
  • Introduced default boundary regex patterns and argument validation for chunking.
  • Added pytest unit tests covering common and edge-case chunking behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
rag-service/crawler/pdf_extractor.py Implements boundary-aware chunking with overlap and default boundary patterns.
rag-service/test_pdf_extractor_chunking.py Adds unit tests to verify chunking behavior, overlap, boundary selection, and validation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +176 to +190
window_end = min(naive_end + lookahead_max, text_len)
# Walk boundary patterns in priority order; the first pattern
# that yields a match wins, but we always pick the latest
# match within the lookahead so the chunk stays close to
# `chunk_size` characters.
best_offset = -1
for pattern in compiled:
for match in pattern.finditer(text, naive_end, window_end):
offset = match.end() - start
if offset > best_offset:
best_offset = offset
if best_offset > 0:
break
if best_offset > 0:
chunk_end = start + best_offset
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. Took the docstring-only fix in commit f1795c0. The implementation is unchanged — the boundary search still extends forward from naive_end, but the docstring now states this explicitly. A boundary at or before naive_end search would change behaviour (smaller chunks, more hard cuts at the start side) and is deferred to a follow-up PR per the maintainer's discretion. 12/12 tests still pass.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 16963ab. The Returns section now spells out the lstrip-only behaviour and the rationale (preserving the boundary used to align the chunk end verbatim for the next chunk's overlap window). 10/10 tests still pass.

Comment on lines +96 to +102
def chunk_text_with_overlap(
text: str,
*,
chunk_size: int = 800,
chunk_overlap: int = 200,
boundary_patterns: tuple[str, ...] = _DEFAULT_BOUNDARY_PATTERNS,
) -> list[str]:
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. Tightening the type annotation to Optional[str] is a public-API change (callers that relied on the implicit empty-check might pass None expecting an error rather than a silent []). Out of scope for the docstring-only round per the maintainer's discretion. Will open a follow-up PR if the maintainer wants to tighten the type. No code change in this PR.

Comment on lines +133 to +136
Returns:
List of non-empty stripped text chunks. The first chunk starts
at the beginning of the input; the last chunk absorbs any
remaining text up to the end of the input.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 16963ab. The Returns section now says lstrip (leading whitespace only) and explains the reason for preserving trailing boundary whitespace verbatim. 10/10 tests still pass.

# Stripping the end would erase the very signal we used to align
# the chunk and would also make overlapping tails diverge from the
# original text.
chunk = text[start:chunk_end].lstrip()
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 16963ab (same fix as the line-156 thread). 10/10 tests still pass.

Comment on lines +57 to +67
def test_boundary_detection_prefers_sentence_terminator():
"""The chunk should end at a sentence boundary, not mid-sentence."""
text = _build_text(50)
chunks = chunk_text_with_overlap(text, chunk_size=300, chunk_overlap=60)
# All but the final chunk should end with a sentence-terminal
# punctuation (".", "?", "!") or a paragraph break, never a partial word.
for chunk in chunks[:-1]:
stripped = chunk.rstrip()
assert stripped[-1] in ".!?", (
f"Chunk should end at a sentence boundary, got: {stripped[-30:]!r}"
)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. Adding a regression test for the boundary before naive_end case is a test-suite change, not a docstring change, so I'm leaving it for a follow-up PR per the docstring-only scope of this round. Will open a separate PR with the additional test once this one lands. The current 10-test suite still passes.

@vipul674
Copy link
Copy Markdown
Author

vipul674 commented Jun 1, 2026

@FireFistisDead please review the pr and let me know if any changes are needed, if not please merge

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
rag-service/test_pdf_extractor_chunking.py (1)

47-54: 💤 Low value

Optional: use itertools.pairwise for the consecutive-pair iteration.

This sidesteps the Ruff B905 warning (zip without strict=) cleanly — note that strict=True would be wrong here since the two operands intentionally differ in length.

♻️ Proposed tweak
+from itertools import pairwise
+
-    for prev, nxt in zip(chunks, chunks[1:]):
+    for prev, nxt in pairwise(chunks):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/test_pdf_extractor_chunking.py` around lines 47 - 54, Replace the
manual consecutive-pair iteration using zip(chunks, chunks[1:]) with
itertools.pairwise to avoid the Ruff B905 warning; in the test that iterates
over chunks (variables chunks, overlap, tail) use pairwise(chunks) so the loop
semantics (comparing prev and nxt, computing tail = prev[-overlap:].rstrip(),
and asserting tail appears in nxt) remain identical while eliminating the
zip-with-slice pattern.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rag-service/crawler/pdf_extractor.py`:
- Around line 170-190: The implementation around naive_end/window_end currently
extends chunks forward (using compiled pattern.finditer over
text[naive_end:window_end] and picking the rightmost match), so chunk_size acts
as a floor not a cap; either update the documentation/comments to state that
chunks may reach chunk_size + chunk_size // 2, or enforce a hard cap by changing
the boundary search to scan backward in the window [start, naive_end] and pick
the latest match ≤ naive_end (instead of scanning forward from naive_end to
window_end). Locate the logic using variables chunk_size, naive_end, window_end
and the loop over compiled patterns and apply the chosen fix consistently (doc
string/comments around target maximum and the matching loop behavior).

---

Nitpick comments:
In `@rag-service/test_pdf_extractor_chunking.py`:
- Around line 47-54: Replace the manual consecutive-pair iteration using
zip(chunks, chunks[1:]) with itertools.pairwise to avoid the Ruff B905 warning;
in the test that iterates over chunks (variables chunks, overlap, tail) use
pairwise(chunks) so the loop semantics (comparing prev and nxt, computing tail =
prev[-overlap:].rstrip(), and asserting tail appears in nxt) remain identical
while eliminating the zip-with-slice pattern.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: df83c2f8-e744-4068-a086-090a7a125172

📥 Commits

Reviewing files that changed from the base of the PR and between 055d3af and c5c6e25.

📒 Files selected for processing (2)
  • rag-service/crawler/pdf_extractor.py
  • rag-service/test_pdf_extractor_chunking.py

Comment thread rag-service/crawler/pdf_extractor.py
vipul674 added 2 commits June 2, 2026 00:15
CodeRabbit review on PR FireFistisDead#462 noted that the boundary search extends
the chunk end forward (not backward), so a chunk that finds a natural
boundary in the lookahead can reach chunk_size + chunk_size // 2
characters. The previous docstring described chunk_size as a
'maximum' which is misleading.

Update the docstring to:
- State that chunk_size is a soft target, not a hard cap
- Document the actual upper bound (chunk_size + chunk_size // 2)
- Explain the trade-off: extending forward keeps the chunk start
  aligned with chunk_size on a regular cadence, while the end snaps
  to a natural break
- Note the boundary_patterns arg's behavior in the matching language
  (was inconsistent with the 'rightmost-in-priority' description)

No code change. No test change. All 12 tests still pass.
- Module-level boundary-search comment still described the rejected
  'right-to-left walks' approach. Reworded to describe the actual
  forward-extension behaviour with the latest-match-in-lookahead
  tiebreaker, and point to the function docstring for the full
  sizing semantics.
- Returns section said 'non-empty stripped' but the code only does
  an lstrip on the start. Tightened to spell out lstrip-vs-rstrip
  asymmetry and the reason (preserving the boundary used to align
  the chunk end for the next chunk's overlap window).

No code change. 10/10 chunking tests still pass.
Copilot AI review requested due to automatic review settings June 1, 2026 18:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Comment on lines +100 to +106
def chunk_text_with_overlap(
text: str,
*,
chunk_size: int = 800,
chunk_overlap: int = 200,
boundary_patterns: tuple[str, ...] = _DEFAULT_BOUNDARY_PATTERNS,
) -> list[str]:
Comment on lines +171 to +172
if not text or not text.strip():
return []
Comment on lines +47 to +54
for prev, nxt in zip(chunks, chunks[1:]):
# At least `overlap` chars of the prev chunk's tail must appear
# in the next chunk. We tolerate trailing whitespace stripping.
tail = prev[-overlap:].rstrip()
assert tail[:20] in nxt, (
f"Expected overlap tail to appear in next chunk.\n"
f"prev tail: {tail!r}\nnext head: {nxt[:120]!r}"
)
# budget (chunk_size // 2) — that proves the algorithm didn't
# accidentally grow a single chunk to the full input length.
for chunk in chunks:
assert len(chunk) <= 200 + (200 // 2) + 16 # 16 = strip slack
@vipul674
Copy link
Copy Markdown
Author

vipul674 commented Jun 2, 2026

@FireFistisDead Please review

@vipul674
Copy link
Copy Markdown
Author

vipul674 commented Jun 2, 2026

@FireFistisDead Please review and let me know for any changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend Express or API gateway work bug Something isn't working feature A new feature or improvement fix A targeted fix or cleanup frontend Frontend-related work invalid This doesn't seem right level:advanced rag-service FastAPI / model service work type:testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Implement Adaptive Text Chunking with Overlap Windowing

2 participants