Skip to content

Update main.py#460

Open
notdekyy wants to merge 3 commits into
FireFistisDead:masterfrom
notdekyy:feature/pdf-adaptive-chunking
Open

Update main.py#460
notdekyy wants to merge 3 commits into
FireFistisDead:masterfrom
notdekyy:feature/pdf-adaptive-chunking

Conversation

@notdekyy
Copy link
Copy Markdown

@notdekyy notdekyy commented Jun 1, 2026

Summary

What changed?

Implemented a robust, crash-proof AdaptiveTextSplitter utility class inside the backend routing file (rag-service/main.py) and swapped the naive token/character slicing loop inside the /process-pdf controller route. Incoming PDF strings are now split based on intelligent sentence-boundary analysis combined with a structural sliding window setup.

Why?

The previous splitting layout cut raw document data blindly by character counts. This frequently fractured sentences mid-word or mid-paragraph. When these broken fragments were stored in the FAISS vector database and retrieved later, the LLM lost vital structural context, which led to lower-quality answers or context-refusal errors. Introducing a 200-character overlapping sliding window ensures complete concept preservation across vector retrieval boundaries.

Related issue

Closes #247

Testing

  • I ran the relevant checks locally
  • I verified the app still starts
  • I tested the affected flow end-to-end

Note on Testing: The algorithm was strictly executed and verified using a mock PDF text array. Confirmed that text blocks cleanly slide and overlap without word-mutilation or high-RAM hanging traps.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas

Screenshots / recordings

N/A — This is a backend RAG preprocessing framework adjustment. No customer-facing UI elements were modified.

Notes

  • This feature parameters default to chunk_size=1000 and chunk_overlap=200 to maximize embedding alignment for standard text-based PDF processing.
  • The iteration pointer includes a safety look-forward condition to completely insulate the background runtime loop from infinite cycle locks and high-RAM server resource exhaustion.

Security

  • No sensitive data included

Summary by CodeRabbit

  • New Features

    • PDF processing now uses adaptive, overlapping text segmentation that prefers paragraph and sentence boundaries for cleaner chunks.
    • Generated chunks continue to include page and file metadata and still honor duplicate-content filtering.
  • Changes

    • Previous semantic-merge and hierarchical parent-context grouping are no longer applied to newly generated chunks, which may affect content consolidation.

@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 1, 2026

@notdekyy is attempting to deploy a commit to the firefistisdead's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 7a18d843-4a9f-4b54-ac79-33b66f3d5e88

📥 Commits

Reviewing files that changed from the base of the PR and between 0562155 and 7330333.

📒 Files selected for processing (1)
  • rag-service/main.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • rag-service/main.py

📝 Walkthrough

Walkthrough

Adds AdaptiveTextSplitter (boundary-aware, overlapping chunks) and updates /process-pdf to call adaptive_splitter.split_text(page_text), converting returned segments into LangChain Documents while preserving de-duplication and per-chunk metadata.

Changes

Adaptive Text Chunking Feature

Layer / File(s) Summary
AdaptiveTextSplitter class
rag-service/main.py
New AdaptiveTextSplitter with configurable chunk_size and chunk_overlap. Iteratively advances through text, seeking "soft" boundaries (double-newline, newline, terminal punctuation) within a lookback window; falls back to hard size limit. Enforces overlap by moving next start index back by chunk_overlap. Returns list of non-empty stripped segments.
Integration into /process-pdf endpoint
rag-service/main.py
Replaces prior semantic_chunk(...) pipeline with adaptive_splitter.split_text(page_text). Directly constructs LangChain Document objects from split segments. Preserves existing de-duplication (seen_content) and chunk metadata shape (document_id, filename, page, chunk_index).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

level:intermediate

Poem

🐰 I hop through text with patient feet,
I find soft breaks where paragraphs meet.
Overlaps hug context tight,
So answers keep their meaning right.
A tiny rabbit nods—chunking complete.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title "Update main.py" is vague and generic, failing to convey the specific purpose of introducing an AdaptiveTextSplitter for intelligent PDF text chunking. Use a more descriptive title such as "Implement AdaptiveTextSplitter for intelligent PDF text chunking" or "Replace naive text splitting with adaptive boundary-aware chunking in /process-pdf".
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The PR description comprehensively covers the summary, related issue, testing checklist, and includes explanations of why the change was needed and security considerations.
Linked Issues check ✅ Passed The implementation fully addresses all coding objectives from issue #247: implements AdaptiveTextSplitter with configurable overlap windowing, smart boundary detection, and integration with the /process-pdf endpoint.
Out of Scope Changes check ✅ Passed All changes in the PR directly support the objective of implementing adaptive text chunking; no out-of-scope modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added backend Express or API gateway work bug Something isn't working docs Documentation only feature A new feature or improvement fix A targeted fix or cleanup frontend Frontend-related work level:advanced rag-service FastAPI / model service work type:security type:testing labels Jun 1, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
rag-service/main.py (2)

68-107: 💤 Low value

Minor edge case when chunk_size < 100.

The boundary search at line 85 assumes end_idx >= 100. If a caller instantiates AdaptiveTextSplitter(chunk_size=50), the expression end_idx - 100 becomes negative, and line 90 could compute a negative boundary_pos, causing incorrect slicing.

With the default chunk_size=1000, this is not an issue. Consider adding a guard or documenting the minimum supported chunk_size.

🛡️ Defensive fix example
             if end_idx < text_len:
-                search_space = text[end_idx - 100 : end_idx]
+                lookback = min(100, end_idx)
+                search_space = text[end_idx - lookback : end_idx]
                 boundary_pos = -1
                 for marker in ["\n\n", "\n", ". ", "? ", "! "]:
                     pos = search_space.rfind(marker)
                     if pos != -1:
-                        boundary_pos = (end_idx - 100) + pos + len(marker)
+                        boundary_pos = (end_idx - lookback) + pos + len(marker)
                         break
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/main.py` around lines 68 - 107, The split_text method in
AdaptiveTextSplitter can compute a negative slice when chunk_size < 100 because
search_space is built with end_idx - 100; fix by clamping the search start
before slicing and adjusting the boundary offset: compute search_start =
max(start_idx, end_idx - 100, 0), set search_space = text[search_start:end_idx],
and when setting boundary_pos use search_start + pos + len(marker) instead of
(end_idx - 100) + pos + len(marker); this guarantees no negative indices and
correct boundary calculations for small chunk_size values.

2650-2658: ⚖️ Poor tradeoff

Hierarchical chunking metadata is no longer attached to PDF chunks.

documents_for_upload() now constructs Document objects from AdaptiveTextSplitter with metadata containing only document_id, filename, page, and chunk_index, so small_chunk/parent_chunk are never set (even when SEMANTIC_CHUNK_HIERARCHICAL is enabled). format_context() therefore falls back to page_content when parent_chunk is absent—no runtime error, but the broader “parent window” context used for generation is lost.

semantic_chunk() still populates small_chunk and parent_chunk, but the PDF processing path appears to bypass it; test_document_id_lineage.py patches semantic_chunk, which may now be dead or less relevant, and there’s also a stale comment referencing metadata being set by semantic_chunk().

If hierarchical context is still desired, either populate parent_chunk/small_chunk in the new AdaptiveTextSplitter chunk construction (reuse _build_parent_context) or update/remove the hierarchical feature/docs/tests to match the new behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/main.py` around lines 2650 - 2658, documents_for_upload()
currently builds Document metadata with only document_id, filename, page and
chunk_index, dropping hierarchical fields so format_context() can't access
parent_chunk/small_chunk; to fix, when SEMANTIC_CHUNK_HIERARCHICAL is enabled
have documents_for_upload() populate metadata.small_chunk and
metadata.parent_chunk by invoking the existing _build_parent_context (or reuse
the logic from semantic_chunk()) for the current chunk before creating Document,
referencing AdaptiveTextSplitter chunk outputs and the Document constructor;
alternatively, if hierarchical context is deprecated, remove/adjust
SEMANTIC_CHUNK_HIERARCHICAL usage and update tests (e.g.,
test_document_id_lineage.py) and stale comments to match the new behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rag-service/main.py`:
- Around line 2643-2644: The variable adaptive_splitter is referenced by
adaptive_splitter.split_text in the /process-pdf handler but never instantiated;
create an instance of your project's text-splitter (e.g., adaptive_splitter =
AdaptiveTextSplitter(...) or the appropriate constructor used elsewhere) before
the loop that calls split_text so adaptive_splitter is defined when used. Place
this instantiation just before the loop/processing block in the /process-pdf
route where split_text is called and ensure any required constructor args (chunk
size, overlap, etc.) match other uses.

---

Nitpick comments:
In `@rag-service/main.py`:
- Around line 68-107: The split_text method in AdaptiveTextSplitter can compute
a negative slice when chunk_size < 100 because search_space is built with
end_idx - 100; fix by clamping the search start before slicing and adjusting the
boundary offset: compute search_start = max(start_idx, end_idx - 100, 0), set
search_space = text[search_start:end_idx], and when setting boundary_pos use
search_start + pos + len(marker) instead of (end_idx - 100) + pos + len(marker);
this guarantees no negative indices and correct boundary calculations for small
chunk_size values.
- Around line 2650-2658: documents_for_upload() currently builds Document
metadata with only document_id, filename, page and chunk_index, dropping
hierarchical fields so format_context() can't access parent_chunk/small_chunk;
to fix, when SEMANTIC_CHUNK_HIERARCHICAL is enabled have documents_for_upload()
populate metadata.small_chunk and metadata.parent_chunk by invoking the existing
_build_parent_context (or reuse the logic from semantic_chunk()) for the current
chunk before creating Document, referencing AdaptiveTextSplitter chunk outputs
and the Document constructor; alternatively, if hierarchical context is
deprecated, remove/adjust SEMANTIC_CHUNK_HIERARCHICAL usage and update tests
(e.g., test_document_id_lineage.py) and stale comments to match the new
behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 0eb970a6-cf91-4c62-8c83-2c2ac1abe84c

📥 Commits

Reviewing files that changed from the base of the PR and between 055d3af and d470511.

📒 Files selected for processing (1)
  • rag-service/main.py

Comment thread rag-service/main.py
@github-actions github-actions Bot added the duplicate This issue or pull request already exists label Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend Express or API gateway work bug Something isn't working docs Documentation only duplicate This issue or pull request already exists feature A new feature or improvement fix A targeted fix or cleanup frontend Frontend-related work level:advanced rag-service FastAPI / model service work type:security type:testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Implement Adaptive Text Chunking with Overlap Windowing

1 participant