Update main.py#460
Conversation
|
@notdekyy is attempting to deploy a commit to the firefistisdead's projects Team on Vercel. A member of the Team first needs to authorize it. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds AdaptiveTextSplitter (boundary-aware, overlapping chunks) and updates /process-pdf to call adaptive_splitter.split_text(page_text), converting returned segments into LangChain Documents while preserving de-duplication and per-chunk metadata. ChangesAdaptive Text Chunking Feature
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
rag-service/main.py (2)
68-107: 💤 Low valueMinor edge case when
chunk_size < 100.The boundary search at line 85 assumes
end_idx >= 100. If a caller instantiatesAdaptiveTextSplitter(chunk_size=50), the expressionend_idx - 100becomes negative, and line 90 could compute a negativeboundary_pos, causing incorrect slicing.With the default
chunk_size=1000, this is not an issue. Consider adding a guard or documenting the minimum supportedchunk_size.🛡️ Defensive fix example
if end_idx < text_len: - search_space = text[end_idx - 100 : end_idx] + lookback = min(100, end_idx) + search_space = text[end_idx - lookback : end_idx] boundary_pos = -1 for marker in ["\n\n", "\n", ". ", "? ", "! "]: pos = search_space.rfind(marker) if pos != -1: - boundary_pos = (end_idx - 100) + pos + len(marker) + boundary_pos = (end_idx - lookback) + pos + len(marker) break🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rag-service/main.py` around lines 68 - 107, The split_text method in AdaptiveTextSplitter can compute a negative slice when chunk_size < 100 because search_space is built with end_idx - 100; fix by clamping the search start before slicing and adjusting the boundary offset: compute search_start = max(start_idx, end_idx - 100, 0), set search_space = text[search_start:end_idx], and when setting boundary_pos use search_start + pos + len(marker) instead of (end_idx - 100) + pos + len(marker); this guarantees no negative indices and correct boundary calculations for small chunk_size values.
2650-2658: ⚖️ Poor tradeoffHierarchical chunking metadata is no longer attached to PDF chunks.
documents_for_upload()now constructsDocumentobjects fromAdaptiveTextSplitterwith metadata containing onlydocument_id,filename,page, andchunk_index, sosmall_chunk/parent_chunkare never set (even whenSEMANTIC_CHUNK_HIERARCHICALis enabled).format_context()therefore falls back topage_contentwhenparent_chunkis absent—no runtime error, but the broader “parent window” context used for generation is lost.
semantic_chunk()still populatessmall_chunkandparent_chunk, but the PDF processing path appears to bypass it;test_document_id_lineage.pypatchessemantic_chunk, which may now be dead or less relevant, and there’s also a stale comment referencing metadata being set bysemantic_chunk().If hierarchical context is still desired, either populate
parent_chunk/small_chunkin the newAdaptiveTextSplitterchunk construction (reuse_build_parent_context) or update/remove the hierarchical feature/docs/tests to match the new behavior.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rag-service/main.py` around lines 2650 - 2658, documents_for_upload() currently builds Document metadata with only document_id, filename, page and chunk_index, dropping hierarchical fields so format_context() can't access parent_chunk/small_chunk; to fix, when SEMANTIC_CHUNK_HIERARCHICAL is enabled have documents_for_upload() populate metadata.small_chunk and metadata.parent_chunk by invoking the existing _build_parent_context (or reuse the logic from semantic_chunk()) for the current chunk before creating Document, referencing AdaptiveTextSplitter chunk outputs and the Document constructor; alternatively, if hierarchical context is deprecated, remove/adjust SEMANTIC_CHUNK_HIERARCHICAL usage and update tests (e.g., test_document_id_lineage.py) and stale comments to match the new behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@rag-service/main.py`:
- Around line 2643-2644: The variable adaptive_splitter is referenced by
adaptive_splitter.split_text in the /process-pdf handler but never instantiated;
create an instance of your project's text-splitter (e.g., adaptive_splitter =
AdaptiveTextSplitter(...) or the appropriate constructor used elsewhere) before
the loop that calls split_text so adaptive_splitter is defined when used. Place
this instantiation just before the loop/processing block in the /process-pdf
route where split_text is called and ensure any required constructor args (chunk
size, overlap, etc.) match other uses.
---
Nitpick comments:
In `@rag-service/main.py`:
- Around line 68-107: The split_text method in AdaptiveTextSplitter can compute
a negative slice when chunk_size < 100 because search_space is built with
end_idx - 100; fix by clamping the search start before slicing and adjusting the
boundary offset: compute search_start = max(start_idx, end_idx - 100, 0), set
search_space = text[search_start:end_idx], and when setting boundary_pos use
search_start + pos + len(marker) instead of (end_idx - 100) + pos + len(marker);
this guarantees no negative indices and correct boundary calculations for small
chunk_size values.
- Around line 2650-2658: documents_for_upload() currently builds Document
metadata with only document_id, filename, page and chunk_index, dropping
hierarchical fields so format_context() can't access parent_chunk/small_chunk;
to fix, when SEMANTIC_CHUNK_HIERARCHICAL is enabled have documents_for_upload()
populate metadata.small_chunk and metadata.parent_chunk by invoking the existing
_build_parent_context (or reuse the logic from semantic_chunk()) for the current
chunk before creating Document, referencing AdaptiveTextSplitter chunk outputs
and the Document constructor; alternatively, if hierarchical context is
deprecated, remove/adjust SEMANTIC_CHUNK_HIERARCHICAL usage and update tests
(e.g., test_document_id_lineage.py) and stale comments to match the new
behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
Summary
What changed?
Implemented a robust, crash-proof
AdaptiveTextSplitterutility class inside the backend routing file (rag-service/main.py) and swapped the naive token/character slicing loop inside the/process-pdfcontroller route. Incoming PDF strings are now split based on intelligent sentence-boundary analysis combined with a structural sliding window setup.Why?
The previous splitting layout cut raw document data blindly by character counts. This frequently fractured sentences mid-word or mid-paragraph. When these broken fragments were stored in the FAISS vector database and retrieved later, the LLM lost vital structural context, which led to lower-quality answers or context-refusal errors. Introducing a 200-character overlapping sliding window ensures complete concept preservation across vector retrieval boundaries.
Related issue
Closes #247
Testing
Note on Testing: The algorithm was strictly executed and verified using a mock PDF text array. Confirmed that text blocks cleanly slide and overlap without word-mutilation or high-RAM hanging traps.
Checklist:
Screenshots / recordings
N/A — This is a backend RAG preprocessing framework adjustment. No customer-facing UI elements were modified.
Notes
chunk_size=1000andchunk_overlap=200to maximize embedding alignment for standard text-based PDF processing.Security
Summary by CodeRabbit
New Features
Changes