fix: limit semantic chunk merge amplification by Namraa310806 · Pull Request #390 · FireFistisDead/pdf-qa-bot

Namraa310806 · 2026-05-30T19:52:27Z

Summary

This PR adds resource-safety guardrails to the semantic chunk merging pipeline to prevent excessive embedding generation during PDF ingestion.

Previously, a highly fragmented or adversarial PDF could produce a very large number of tiny chunks, causing _split_pass2() to generate embeddings for a large set of merge candidates. This could unnecessarily increase CPU and memory consumption during document processing.

Changes Made

Added limits for semantic merge processing:
- SEMANTIC_CHUNK_MAX_TINY_CHUNKS
- SEMANTIC_CHUNK_MAX_MERGE_CANDIDATES
Added early-exit safeguards when configured limits are exceeded.
Prevented embedding generation for oversized merge candidate sets.
Preserved existing semantic merge behavior for normal-sized documents.
Added regression tests covering:
- normal semantic merge behavior
- tiny chunk limit enforcement
- merge candidate limit enforcement
- verification that embeddings are not generated when limits are exceeded

Related issue

Fixes: #357

Testing

I ran the relevant checks locally
I verified the app still starts
I tested the affected flow end-to-end

Verified Scenarios

Normal document ingestion continues to perform semantic chunk merging.
Small chunk sets continue to merge correctly.
Excessive tiny chunk counts trigger safe fallback behavior.
Excessive merge candidate counts skip semantic merging.
Embedding generation is avoided when configured limits are exceeded.
Existing PDF processing functionality remains operational.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas

Screenshots / recordings

N/A – Backend-only performance and stability improvement.

Notes

This change is intentionally scoped to semantic chunk merging and does not modify:

retrieval logic
embeddings model behavior
vectorstore persistence
session handling
answer generation

The goal is to reduce the risk of CPU and memory amplification during ingestion while preserving normal document processing behavior.

Security

No sensitive data included

Summary by CodeRabbit

New Features
- Added configurable limits to the semantic chunking pipeline to avoid excessive computation; merge step is skipped when thresholds are exceeded and a warning is logged.
- Exposed two environment variables to tune those thresholds.
Tests
- Added tests validating normal merging and the two limit-triggered skip behaviors.

vercel · 2026-05-30T19:52:32Z

@Namraa310806 is attempting to deploy a commit to the firefistisdead's projects Team on Vercel.

A member of the Team first needs to authorize it.

coderabbitai · 2026-05-30T19:52:37Z

Caution

Review failed

Failed to post review comments

📝 Walkthrough

Walkthrough

The PR adds environment-configurable resource limits to the semantic chunk merge pipeline to prevent excessive embedding work on highly fragmented inputs. _split_pass2 now skips embedding-based merges and returns raw_chunks when tiny-chunk or merge-candidate counts exceed configured thresholds. A new test module verifies normal merging and both skip paths.

Changes

Semantic merge resource limits

Layer / File(s)	Summary
Configuration and limit enforcement `rag-service/main.py`	Adds `SEMANTIC_CHUNK_MAX_TINY_CHUNKS` and `SEMANTIC_CHUNK_MAX_MERGE_CANDIDATES` (env-loaded) and early-return checks in `_split_pass2` that log a warning and return `raw_chunks` when either threshold is exceeded.
Test validation `rag-service/tests/test_semantic_merge_limits.py`	New tests stub heavy deps, set env, provide `_FakeEmbeddingModel`, and assert `_split_pass2` merges for small inputs and skips (without calling embeddings) when tiny-chunk or candidate limits are exceeded.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

FireFistisDead/pdf-qa-bot#207: Introduces the 2-phase semantic merge pipeline in _split_pass2; this PR adds resource-limit safeguards to the same pass-2 implementation.

Suggested labels

bug

Poem

🐰 A fragmented PDF hopped in with glee,
I counted the chunks beneath a tall tree.
If tiny bits swarm or candidates climb,
I skip the big embed — save CPU time.
Tests cheer, and the service stays snappy as me!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix: limit semantic chunk merge amplification' clearly and concisely summarizes the main change—adding limits to prevent resource exhaustion during semantic chunk merging.
Description check	✅ Passed	The pull request description is comprehensive and well-structured, covering summary, related issue, testing verification, checklist items, and security considerations aligned with the template.
Linked Issues check	✅ Passed	The PR successfully addresses all primary objectives from issue `#357`: added configurable limits (SEMANTIC_CHUNK_MAX_TINY_CHUNKS, SEMANTIC_CHUNK_MAX_MERGE_CANDIDATES), implemented early-exit safeguards, prevented embedding generation when limits exceeded, and preserved normal semantic merge behavior.
Out of Scope Changes check	✅ Passed	All changes are scoped to semantic chunk merging in main.py and new tests; no modifications to retrieval logic, embeddings behavior, vectorstore persistence, session handling, or answer generation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds guardrails to the semantic merge step to prevent excessive embedding work on highly-fragmented/adversarial inputs, along with regression tests covering the new “skip merge” paths.

Changes:

Introduces configurable upper bounds for “tiny chunk” count and “merge candidate” count.
Skips semantic merge (with warnings) when those bounds are exceeded.
Adds unit tests validating normal merge behavior and both new skip conditions.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
rag-service/main.py	Adds two env-configured limits and early-return skip logic in `_split_pass2` to bound semantic merge work.
rag-service/tests/test_semantic_merge_limits.py	Adds regression tests for normal merge and the two new resource-limit skip paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    # Keep semantic merge work bounded for adversarial inputs that fragment a page
+    # into a large number of tiny chunks. Normal PDFs stay on the merge path.
+    if len(tiny_indices) > SEMANTIC_CHUNK_MAX_TINY_CHUNKS:
+        logger.warning(
+            "Semantic merge skipped tiny_chunks=%s limit=%s",
+            len(tiny_indices),
+            SEMANTIC_CHUNK_MAX_TINY_CHUNKS,
+        )
+        return list(raw_chunks)


+    if len(sorted_indices) > SEMANTIC_CHUNK_MAX_MERGE_CANDIDATES:
+        logger.warning(
+            "Semantic merge skipped candidates=%s tiny_chunks=%s limit=%s",
+            len(sorted_indices),
+            len(tiny_indices),
+            SEMANTIC_CHUNK_MAX_MERGE_CANDIDATES,
+        )
+        return list(raw_chunks)


+def _stub_heavy_deps():
+    for name in [
+        "torch",
+        "numpy",
+        "langchain_community",
+        "langchain_community.vectorstores",
+        "langchain_community.embeddings",
+        "transformers",
+        "rank_bm25",
+        "pdf_parse_worker",
+    ]:
+        if name not in sys.modules:
+            sys.modules[name] = types.ModuleType(name)


+os.environ.setdefault("JWT_SECRET", "test-secret-for-ci")
+os.environ.setdefault("INTERNAL_RAG_TOKEN", "test-secret")
+
+_stub_heavy_deps()


+sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+import main  # noqa: E402


…mits

fix: limit semantic chunk merge amplification

dcd7dd9

Copilot AI review requested due to automatic review settings May 30, 2026 19:52

Copilot AI reviewed May 30, 2026

View reviewed changes

github-actions Bot added the level:intermediate label May 30, 2026

Merge branch 'FireFistisDead:master' into fix/semantic-chunk-merge-li…

461892a

…mits

Namraa310806 mentioned this pull request Jun 1, 2026

fix: standardize SSE streaming across frontend and backend #405

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: limit semantic chunk merge amplification#390

fix: limit semantic chunk merge amplification#390
Namraa310806 wants to merge 2 commits into
FireFistisDead:masterfrom
Namraa310806:fix/semantic-chunk-merge-limits

Namraa310806 commented May 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

vercel Bot commented May 30, 2026

Uh oh!

coderabbitai Bot commented May 30, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
		import main # noqa: E402

Conversation

Namraa310806 commented May 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes Made

Related issue

Testing

Verified Scenarios

Checklist:

Screenshots / recordings

Notes

Security

Summary by CodeRabbit

Uh oh!

vercel Bot commented May 30, 2026

Uh oh!

coderabbitai Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Namraa310806 commented May 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 30, 2026 •

edited

Loading