Update main.py by notdekyy · Pull Request #460 · FireFistisDead/pdf-qa-bot

notdekyy · 2026-06-01T17:37:16Z

Summary

What changed?

Implemented a robust, crash-proof AdaptiveTextSplitter utility class inside the backend routing file (rag-service/main.py) and swapped the naive token/character slicing loop inside the /process-pdf controller route. Incoming PDF strings are now split based on intelligent sentence-boundary analysis combined with a structural sliding window setup.

Why?

The previous splitting layout cut raw document data blindly by character counts. This frequently fractured sentences mid-word or mid-paragraph. When these broken fragments were stored in the FAISS vector database and retrieved later, the LLM lost vital structural context, which led to lower-quality answers or context-refusal errors. Introducing a 200-character overlapping sliding window ensures complete concept preservation across vector retrieval boundaries.

Related issue

Closes #247

Testing

I ran the relevant checks locally
I verified the app still starts
I tested the affected flow end-to-end

Note on Testing: The algorithm was strictly executed and verified using a mock PDF text array. Confirmed that text blocks cleanly slide and overlap without word-mutilation or high-RAM hanging traps.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas

Screenshots / recordings

N/A — This is a backend RAG preprocessing framework adjustment. No customer-facing UI elements were modified.

Notes

This feature parameters default to chunk_size=1000 and chunk_overlap=200 to maximize embedding alignment for standard text-based PDF processing.
The iteration pointer includes a safety look-forward condition to completely insulate the background runtime loop from infinite cycle locks and high-RAM server resource exhaustion.

Security

No sensitive data included

Summary by CodeRabbit

New Features
- PDF processing now uses adaptive, overlapping text segmentation that prefers paragraph and sentence boundaries for cleaner chunks.
- Generated chunks continue to include page and file metadata and still honor duplicate-content filtering.
Changes
- Previous semantic-merge and hierarchical parent-context grouping are no longer applied to newly generated chunks, which may affect content consolidation.

vercel · 2026-06-01T17:37:20Z

@notdekyy is attempting to deploy a commit to the firefistisdead's projects Team on Vercel.

A member of the Team first needs to authorize it.

coderabbitai · 2026-06-01T17:37:29Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 7a18d843-4a9f-4b54-ac79-33b66f3d5e88

📥 Commits

Reviewing files that changed from the base of the PR and between 0562155 and 7330333.

📒 Files selected for processing (1)

rag-service/main.py

🚧 Files skipped from review as they are similar to previous changes (1)

rag-service/main.py

📝 Walkthrough

Walkthrough

Adds AdaptiveTextSplitter (boundary-aware, overlapping chunks) and updates /process-pdf to call adaptive_splitter.split_text(page_text), converting returned segments into LangChain Documents while preserving de-duplication and per-chunk metadata.

Changes

Adaptive Text Chunking Feature

Layer / File(s)	Summary
AdaptiveTextSplitter class `rag-service/main.py`	New `AdaptiveTextSplitter` with configurable `chunk_size` and `chunk_overlap`. Iteratively advances through text, seeking "soft" boundaries (double-newline, newline, terminal punctuation) within a lookback window; falls back to hard size limit. Enforces overlap by moving next start index back by `chunk_overlap`. Returns list of non-empty stripped segments.
Integration into /process-pdf endpoint `rag-service/main.py`	Replaces prior `semantic_chunk(...)` pipeline with `adaptive_splitter.split_text(page_text)`. Directly constructs LangChain `Document` objects from split segments. Preserves existing de-duplication (`seen_content`) and chunk metadata shape (`document_id`, `filename`, `page`, `chunk_index`).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

FireFistisDead/pdf-qa-bot#207: Related to prior chunking changes that this PR replaces with the new AdaptiveTextSplitter integration.

Suggested labels

level:intermediate

Poem

🐰 I hop through text with patient feet,
I find soft breaks where paragraphs meet.
Overlaps hug context tight,
So answers keep their meaning right.
A tiny rabbit nods—chunking complete.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title "Update main.py" is vague and generic, failing to convey the specific purpose of introducing an AdaptiveTextSplitter for intelligent PDF text chunking.	Use a more descriptive title such as "Implement AdaptiveTextSplitter for intelligent PDF text chunking" or "Replace naive text splitting with adaptive boundary-aware chunking in /process-pdf".
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description comprehensively covers the summary, related issue, testing checklist, and includes explanations of why the change was needed and security considerations.
Linked Issues check	✅ Passed	The implementation fully addresses all coding objectives from issue `#247`: implements AdaptiveTextSplitter with configurable overlap windowing, smart boundary detection, and integration with the /process-pdf endpoint.
Out of Scope Changes check	✅ Passed	All changes in the PR directly support the objective of implementing adaptive text chunking; no out-of-scope modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

rag-service/main.py (2)
68-107: 💤 Low value

Minor edge case when chunk_size < 100.

The boundary search at line 85 assumes end_idx >= 100. If a caller instantiates AdaptiveTextSplitter(chunk_size=50), the expression end_idx - 100 becomes negative, and line 90 could compute a negative boundary_pos, causing incorrect slicing.

With the default chunk_size=1000, this is not an issue. Consider adding a guard or documenting the minimum supported chunk_size.
🛡️ Defensive fix example
             if end_idx < text_len:
-                search_space = text[end_idx - 100 : end_idx]
+                lookback = min(100, end_idx)
+                search_space = text[end_idx - lookback : end_idx]
                 boundary_pos = -1
                 for marker in ["\n\n", "\n", ". ", "? ", "! "]:
                     pos = search_space.rfind(marker)
                     if pos != -1:
-                        boundary_pos = (end_idx - 100) + pos + len(marker)
+                        boundary_pos = (end_idx - lookback) + pos + len(marker)
                         break
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/main.py` around lines 68 - 107, The split_text method in
AdaptiveTextSplitter can compute a negative slice when chunk_size < 100 because
search_space is built with end_idx - 100; fix by clamping the search start
before slicing and adjusting the boundary offset: compute search_start =
max(start_idx, end_idx - 100, 0), set search_space = text[search_start:end_idx],
and when setting boundary_pos use search_start + pos + len(marker) instead of
(end_idx - 100) + pos + len(marker); this guarantees no negative indices and
correct boundary calculations for small chunk_size values.
2650-2658: ⚖️ Poor tradeoff

Hierarchical chunking metadata is no longer attached to PDF chunks.

documents_for_upload() now constructs Document objects from AdaptiveTextSplitter with metadata containing only document_id, filename, page, and chunk_index, so small_chunk/parent_chunk are never set (even when SEMANTIC_CHUNK_HIERARCHICAL is enabled). format_context() therefore falls back to page_content when parent_chunk is absent—no runtime error, but the broader “parent window” context used for generation is lost.

semantic_chunk() still populates small_chunk and parent_chunk, but the PDF processing path appears to bypass it; test_document_id_lineage.py patches semantic_chunk, which may now be dead or less relevant, and there’s also a stale comment referencing metadata being set by semantic_chunk().

If hierarchical context is still desired, either populate parent_chunk/small_chunk in the new AdaptiveTextSplitter chunk construction (reuse _build_parent_context) or update/remove the hierarchical feature/docs/tests to match the new behavior.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/main.py` around lines 2650 - 2658, documents_for_upload()
currently builds Document metadata with only document_id, filename, page and
chunk_index, dropping hierarchical fields so format_context() can't access
parent_chunk/small_chunk; to fix, when SEMANTIC_CHUNK_HIERARCHICAL is enabled
have documents_for_upload() populate metadata.small_chunk and
metadata.parent_chunk by invoking the existing _build_parent_context (or reuse
the logic from semantic_chunk()) for the current chunk before creating Document,
referencing AdaptiveTextSplitter chunk outputs and the Document constructor;
alternatively, if hierarchical context is deprecated, remove/adjust
SEMANTIC_CHUNK_HIERARCHICAL usage and update tests (e.g.,
test_document_id_lineage.py) and stale comments to match the new behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rag-service/main.py`:
- Around line 2643-2644: The variable adaptive_splitter is referenced by
adaptive_splitter.split_text in the /process-pdf handler but never instantiated;
create an instance of your project's text-splitter (e.g., adaptive_splitter =
AdaptiveTextSplitter(...) or the appropriate constructor used elsewhere) before
the loop that calls split_text so adaptive_splitter is defined when used. Place
this instantiation just before the loop/processing block in the /process-pdf
route where split_text is called and ensure any required constructor args (chunk
size, overlap, etc.) match other uses.

---

Nitpick comments:
In `@rag-service/main.py`:
- Around line 68-107: The split_text method in AdaptiveTextSplitter can compute
a negative slice when chunk_size < 100 because search_space is built with
end_idx - 100; fix by clamping the search start before slicing and adjusting the
boundary offset: compute search_start = max(start_idx, end_idx - 100, 0), set
search_space = text[search_start:end_idx], and when setting boundary_pos use
search_start + pos + len(marker) instead of (end_idx - 100) + pos + len(marker);
this guarantees no negative indices and correct boundary calculations for small
chunk_size values.
- Around line 2650-2658: documents_for_upload() currently builds Document
metadata with only document_id, filename, page and chunk_index, dropping
hierarchical fields so format_context() can't access parent_chunk/small_chunk;
to fix, when SEMANTIC_CHUNK_HIERARCHICAL is enabled have documents_for_upload()
populate metadata.small_chunk and metadata.parent_chunk by invoking the existing
_build_parent_context (or reuse the logic from semantic_chunk()) for the current
chunk before creating Document, referencing AdaptiveTextSplitter chunk outputs
and the Document constructor; alternatively, if hierarchical context is
deprecated, remove/adjust SEMANTIC_CHUNK_HIERARCHICAL usage and update tests
(e.g., test_document_id_lineage.py) and stale comments to match the new
behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 0eb970a6-cf91-4c62-8c83-2c2ac1abe84c

📥 Commits

Reviewing files that changed from the base of the PR and between 055d3af and d470511.

📒 Files selected for processing (1)

rag-service/main.py

Update main.py

d470511

coderabbitai Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread rag-service/main.py

Update main.py

0562155

github-actions Bot added the duplicate This issue or pull request already exists label Jun 1, 2026

Update main.py

7330333

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update main.py#460

Update main.py#460
notdekyy wants to merge 3 commits into
FireFistisDead:masterfrom
notdekyy:feature/pdf-adaptive-chunking

notdekyy commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

vercel Bot commented Jun 1, 2026

Uh oh!

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

notdekyy commented Jun 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed?

Why?

Related issue

Testing

Checklist:

Screenshots / recordings

Notes

Security

Summary by CodeRabbit

Uh oh!

vercel Bot commented Jun 1, 2026

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

notdekyy commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading