feat(rag): add chunk_text_with_overlap utility for adaptive text chunking by vipul674 · Pull Request #462 · FireFistisDead/pdf-qa-bot

vipul674 · 2026-06-01T18:19:46Z

What

Adds a new pure-text utility chunk_text_with_overlap in rag-service/crawler/pdf_extractor.py that splits input into chunks of roughly chunk_size characters with a sliding chunk_overlap window, snapping each chunk's end to the latest natural boundary (paragraph break, sentence terminal) within a chunk_size // 2 lookahead.

Why

The current pipeline hard-cuts text at fixed character offsets, slicing sentences and even words mid-token. That breaks retrieval quality because the embedding model sees partial sentences whose semantic content is lost. Boundary-aware chunking with overlap preserves cross-boundary context (a sentence that ends at the tail of one chunk is fully visible at the head of the next) while keeping each chunk close to the target size.

Scope

Purely additive. No existing call sites are modified, no behavior changes for current consumers, no breaking API. The new function lives next to the other PDF/text helpers in pdf_extractor.py and can be wired into the chunking pipeline behind a feature flag or A/B test before the legacy pass is retired.

Design notes

Boundary priority: paragraph break (\n{2,}) wins over sentence terminal ((?<=[.!?])\s+). The first pattern that yields a match in the lookahead wins.
Rightmost match: the algorithm walks finditer and picks the latest match within the chunk_size // 2 lookahead so the chunk stays close to the target size rather than collapsing at the first boundary.
Forward-progress guarantee: when no boundary is found in the lookahead, the algorithm falls back to a hard cut at exactly chunk_size and advances the cursor at least one character. This handles pathological inputs (no whitespace, no sentence terminals) without infinite loops.
Overlap: the next chunk starts chunk_overlap characters back from the previous chunk's end, so consecutive chunks share a tail of approximately chunk_overlap characters. The overlap is not re-snapped to a boundary — that would defeat the purpose of preserving the original tail.
lstrip only on the start: the end of the chunk is preserved verbatim so the natural boundary (e.g. trailing \n\n) stays in the text. This matters for any downstream tokenizer or sentence-segmenter that expects real whitespace cues.
Custom boundaries: callers can pass their own tuple of boundary patterns (e.g. markdown section breaks, code-fence delimiters) for domain-specific splitting.

Tests

10 new tests in test_pdf_extractor_chunking.py:

test_empty_input_returns_empty_list — None, "", whitespace-only inputs return [].
test_short_input_returns_single_chunk — inputs shorter than chunk_size return a single chunk.
test_long_input_is_split_into_multiple_chunks — long input is split, chunks stay near target size.
test_consecutive_chunks_share_overlap_region — overlap region is present in both adjacent chunks.
test_boundary_detection_prefers_sentence_terminator — single-period lookahead is honored.
test_paragraph_breaks_take_priority_over_sentence_terminals — blank-line break wins over period.
test_forward_progress_guaranteed_on_oversized_input — no-boundary input still terminates and advances.
test_invalid_arguments_raise_value_error — chunk_size <= 0, negative overlap, overlap >= size all raise.
test_custom_boundary_patterns_are_honored — caller-provided regex overrides defaults.
test_no_overlap_when_chunk_overlap_is_zero — zero-overlap config produces disjoint chunks that reconstruct the input.

All 10 pass. Existing test_crawler.py still passes (12/12 total in rag-service/).

Out of scope

Wiring chunk_text_with_overlap into the live main.py chunking pipeline — this PR only adds the utility. A follow-up PR can gate it behind a feature flag and A/B test retrieval quality before the legacy pass is retired.
Token-aware chunking (tiktoken, sentencepiece) — that requires a tokenizer dep and is a separate concern.
Streaming chunking for very large inputs — current implementation holds the full text in memory; PDFs are bounded so this is fine for now.

Refs

Closes #247

Summary by CodeRabbit

New Features
- Improved document processing with intelligent text chunking that prefers paragraph and sentence boundaries, supports configurable chunk size and overlap, avoids empty chunks, and guarantees forward progress on very long segments.
Tests
- Added comprehensive tests for empty/short input, multi-chunk splitting, overlap behavior, boundary priority, custom boundaries, invalid-argument handling, and reconstruction without overlap.

…king Adds a pure-text utility that splits input into chunks of roughly chunk_size characters with a sliding chunk_overlap window, snapping each chunk's end to the latest natural boundary (paragraph break, sentence terminal) within a chunk_size // 2 lookahead. The result is boundary-aware chunks that preserve cross-boundary context for the vector store while staying close to the target size. This is purely additive and does not modify any existing call sites; consumers can opt in by wiring the new function into the chunking pipeline behind a feature flag or A/B test before the legacy pass is retired. Refs FireFistisDead#247

vercel · 2026-06-01T18:19:50Z

@vipul674 is attempting to deploy a commit to the firefistisdead's projects Team on Vercel.

A member of the Team first needs to authorize it.

coderabbitai · 2026-06-01T18:19:58Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 98098460-080e-4d20-84e9-de3a6bb698c6

📥 Commits

Reviewing files that changed from the base of the PR and between f1795c0 and 16963ab.

📒 Files selected for processing (1)

rag-service/crawler/pdf_extractor.py

🚧 Files skipped from review as they are similar to previous changes (1)

rag-service/crawler/pdf_extractor.py

📝 Walkthrough

Walkthrough

This PR adds an adaptive text chunker chunk_text_with_overlap that splits text into overlapping character chunks using prioritized regex boundaries (paragraphs then sentence terminators) and a comprehensive pytest suite validating edge cases, overlap, boundary priority, custom patterns, and termination.

Changes

Text Chunking with Overlap Support

Layer / File(s)	Summary
Chunking algorithm and boundary patterns `rag-service/crawler/pdf_extractor.py`	Adds `re` import, defines `_DEFAULT_BOUNDARY_PATTERNS` (blank lines prioritized over sentence terminators), and implements `chunk_text_with_overlap` with input validation, regex pre-compilation, lookahead-based boundary selection, overlap-aware advancement, and forward-progress guarantees.
Comprehensive test coverage for chunking behavior `rag-service/test_pdf_extractor_chunking.py`	Test module with 10 test cases covering empty input and `None`, short/long text splitting, consecutive-chunk overlap verification, sentence vs. mid-sentence cuts, paragraph-priority, forward-progress on oversized input, invalid-argument checks, custom boundary pattern handling, and reconstruction when `chunk_overlap=0`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

level:intermediate

Poem

🐰 I nibble at text, one chunk at a time,
I favor blank lines and sentence-end rhyme.
Overlaps stitch meaning tight and neat,
No orphaned phrases left incomplete.
Hop—context preserved, a tidy feat!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 69.23% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: addition of a new chunk_text_with_overlap utility function for adaptive text chunking with boundary awareness and overlap windowing.
Description check	✅ Passed	The description is comprehensive and well-structured, covering what changed, why it matters, design rationale, testing coverage, and scope boundaries, though it deviates from the template format by using custom sections.
Linked Issues check	✅ Passed	The implementation fully addresses all coding requirements from issue `#247`: sliding-window overlap, smart boundary detection (paragraph breaks prioritized over sentence terminals), adaptive chunk sizing, forward-progress guarantee, and custom boundary pattern support.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to the stated objective: implementation of chunk_text_with_overlap utility and comprehensive unit tests. No unrelated modifications or pipeline integration changes present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an adaptive, boundary-aware text chunking helper to the PDF extractor and introduces unit tests to validate chunk sizing, overlap behavior, boundary preference, and input validation.

Changes:

Added chunk_text_with_overlap() with paragraph/sentence boundary detection and overlap-based sliding windows.
Introduced default boundary regex patterns and argument validation for chunking.
Added pytest unit tests covering common and edge-case chunking behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
rag-service/crawler/pdf_extractor.py	Implements boundary-aware chunking with overlap and default boundary patterns.
rag-service/test_pdf_extractor_chunking.py	Adds unit tests to verify chunking behavior, overlap, boundary selection, and validation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vipul674 · 2026-06-01T18:46:45Z

+            window_end = min(naive_end + lookahead_max, text_len)
+            # Walk boundary patterns in priority order; the first pattern
+            # that yields a match wins, but we always pick the latest
+            # match within the lookahead so the chunk stays close to
+            # `chunk_size` characters.
+            best_offset = -1
+            for pattern in compiled:
+                for match in pattern.finditer(text, naive_end, window_end):
+                    offset = match.end() - start
+                    if offset > best_offset:
+                        best_offset = offset
+                if best_offset > 0:
+                    break
+            if best_offset > 0:
+                chunk_end = start + best_offset


Acknowledged. Took the docstring-only fix in commit f1795c0. The implementation is unchanged — the boundary search still extends forward from naive_end, but the docstring now states this explicitly. A boundary at or before naive_end search would change behaviour (smaller chunks, more hard cuts at the start side) and is deferred to a follow-up PR per the maintainer's discretion. 12/12 tests still pass.

Fixed in commit 16963ab. The Returns section now spells out the lstrip-only behaviour and the rationale (preserving the boundary used to align the chunk end verbatim for the next chunk's overlap window). 10/10 tests still pass.

vipul674 · 2026-06-01T18:51:28Z

+def chunk_text_with_overlap(
+    text: str,
+    *,
+    chunk_size: int = 800,
+    chunk_overlap: int = 200,
+    boundary_patterns: tuple[str, ...] = _DEFAULT_BOUNDARY_PATTERNS,
+) -> list[str]:


Acknowledged. Tightening the type annotation to Optional[str] is a public-API change (callers that relied on the implicit empty-check might pass None expecting an error rather than a silent []). Out of scope for the docstring-only round per the maintainer's discretion. Will open a follow-up PR if the maintainer wants to tighten the type. No code change in this PR.

vipul674 · 2026-06-01T18:51:31Z

+    Returns:
+        List of non-empty stripped text chunks. The first chunk starts
+        at the beginning of the input; the last chunk absorbs any
+        remaining text up to the end of the input.


Fixed in commit 16963ab. The Returns section now says lstrip (leading whitespace only) and explains the reason for preserving trailing boundary whitespace verbatim. 10/10 tests still pass.

vipul674 · 2026-06-01T18:51:33Z

+        # Stripping the end would erase the very signal we used to align
+        # the chunk and would also make overlapping tails diverge from the
+        # original text.
+        chunk = text[start:chunk_end].lstrip()


Fixed in commit 16963ab (same fix as the line-156 thread). 10/10 tests still pass.

vipul674 · 2026-06-01T18:51:36Z

+def test_boundary_detection_prefers_sentence_terminator():
+    """The chunk should end at a sentence boundary, not mid-sentence."""
+    text = _build_text(50)
+    chunks = chunk_text_with_overlap(text, chunk_size=300, chunk_overlap=60)
+    # All but the final chunk should end with a sentence-terminal
+    # punctuation (".", "?", "!") or a paragraph break, never a partial word.
+    for chunk in chunks[:-1]:
+        stripped = chunk.rstrip()
+        assert stripped[-1] in ".!?", (
+            f"Chunk should end at a sentence boundary, got: {stripped[-30:]!r}"
+        )


Acknowledged. Adding a regression test for the boundary before naive_end case is a test-suite change, not a docstring change, so I'm leaving it for a follow-up PR per the docstring-only scope of this round. Will open a separate PR with the additional test once this one lands. The current 10-test suite still passes.

vipul674 · 2026-06-01T18:22:29Z

@FireFistisDead please review the pr and let me know if any changes are needed, if not please merge

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

rag-service/test_pdf_extractor_chunking.py (1)
47-54: 💤 Low value

Optional: use itertools.pairwise for the consecutive-pair iteration.

This sidesteps the Ruff B905 warning (zip without strict=) cleanly — note that strict=True would be wrong here since the two operands intentionally differ in length.
♻️ Proposed tweak
+from itertools import pairwise
+
-    for prev, nxt in zip(chunks, chunks[1:]):
+    for prev, nxt in pairwise(chunks):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/test_pdf_extractor_chunking.py` around lines 47 - 54, Replace the
manual consecutive-pair iteration using zip(chunks, chunks[1:]) with
itertools.pairwise to avoid the Ruff B905 warning; in the test that iterates
over chunks (variables chunks, overlap, tail) use pairwise(chunks) so the loop
semantics (comparing prev and nxt, computing tail = prev[-overlap:].rstrip(),
and asserting tail appears in nxt) remain identical while eliminating the
zip-with-slice pattern.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rag-service/crawler/pdf_extractor.py`:
- Around line 170-190: The implementation around naive_end/window_end currently
extends chunks forward (using compiled pattern.finditer over
text[naive_end:window_end] and picking the rightmost match), so chunk_size acts
as a floor not a cap; either update the documentation/comments to state that
chunks may reach chunk_size + chunk_size // 2, or enforce a hard cap by changing
the boundary search to scan backward in the window [start, naive_end] and pick
the latest match ≤ naive_end (instead of scanning forward from naive_end to
window_end). Locate the logic using variables chunk_size, naive_end, window_end
and the loop over compiled patterns and apply the chosen fix consistently (doc
string/comments around target maximum and the matching loop behavior).

---

Nitpick comments:
In `@rag-service/test_pdf_extractor_chunking.py`:
- Around line 47-54: Replace the manual consecutive-pair iteration using
zip(chunks, chunks[1:]) with itertools.pairwise to avoid the Ruff B905 warning;
in the test that iterates over chunks (variables chunks, overlap, tail) use
pairwise(chunks) so the loop semantics (comparing prev and nxt, computing tail =
prev[-overlap:].rstrip(), and asserting tail appears in nxt) remain identical
while eliminating the zip-with-slice pattern.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: df83c2f8-e744-4068-a086-090a7a125172

📥 Commits

Reviewing files that changed from the base of the PR and between 055d3af and c5c6e25.

📒 Files selected for processing (2)

rag-service/crawler/pdf_extractor.py
rag-service/test_pdf_extractor_chunking.py

CodeRabbit review on PR FireFistisDead#462 noted that the boundary search extends the chunk end forward (not backward), so a chunk that finds a natural boundary in the lookahead can reach chunk_size + chunk_size // 2 characters. The previous docstring described chunk_size as a 'maximum' which is misleading. Update the docstring to: - State that chunk_size is a soft target, not a hard cap - Document the actual upper bound (chunk_size + chunk_size // 2) - Explain the trade-off: extending forward keeps the chunk start aligned with chunk_size on a regular cadence, while the end snaps to a natural break - Note the boundary_patterns arg's behavior in the matching language (was inconsistent with the 'rightmost-in-priority' description) No code change. No test change. All 12 tests still pass.

- Module-level boundary-search comment still described the rejected 'right-to-left walks' approach. Reworded to describe the actual forward-extension behaviour with the latest-match-in-lookahead tiebreaker, and point to the function docstring for the full sizing semantics. - Returns section said 'non-empty stripped' but the code only does an lstrip on the start. Tightened to spell out lstrip-vs-rstrip asymmetry and the reason (preserving the boundary used to align the chunk end for the next chunk's overlap window). No code change. 10/10 chunking tests still pass.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

+def chunk_text_with_overlap(
+    text: str,
+    *,
+    chunk_size: int = 800,
+    chunk_overlap: int = 200,
+    boundary_patterns: tuple[str, ...] = _DEFAULT_BOUNDARY_PATTERNS,
+) -> list[str]:


+    if not text or not text.strip():
+        return []


+    for prev, nxt in zip(chunks, chunks[1:]):
+        # At least `overlap` chars of the prev chunk's tail must appear
+        # in the next chunk. We tolerate trailing whitespace stripping.
+        tail = prev[-overlap:].rstrip()
+        assert tail[:20] in nxt, (
+            f"Expected overlap tail to appear in next chunk.\n"
+            f"prev tail: {tail!r}\nnext head:  {nxt[:120]!r}"
+        )


+    # budget (chunk_size // 2) — that proves the algorithm didn't
+    # accidentally grow a single chunk to the full input length.
+    for chunk in chunks:
+        assert len(chunk) <= 200 + (200 // 2) + 16  # 16 = strip slack


vipul674 · 2026-06-02T02:18:24Z

@FireFistisDead Please review

vipul674 · 2026-06-02T02:29:06Z

@FireFistisDead Please review and let me know for any changes.

Copilot AI review requested due to automatic review settings June 1, 2026 18:19

vipul674 mentioned this pull request Jun 1, 2026

[Feature]: Implement Adaptive Text Chunking with Overlap Windowing #247

Open

Copilot AI reviewed Jun 1, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread rag-service/crawler/pdf_extractor.py

vipul674 added 2 commits June 2, 2026 00:15

Copilot AI review requested due to automatic review settings June 1, 2026 18:50

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Conversation

vipul674 commented Jun 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Scope

Design notes

Tests

Out of scope

Refs

Summary by CodeRabbit

Uh oh!

vercel Bot commented Jun 1, 2026

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

vipul674 Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

vipul674 Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

vipul674 Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

vipul674 Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

vipul674 Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

vipul674 Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

vipul674 commented Jun 1, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

vipul674 commented Jun 2, 2026

Uh oh!

vipul674 commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vipul674 commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading