fix: limit semantic chunk merge amplification#390
Conversation
|
@Namraa310806 is attempting to deploy a commit to the firefistisdead's projects Team on Vercel. A member of the Team first needs to authorize it. |
|
Caution Review failedFailed to post review comments 📝 WalkthroughWalkthroughThe PR adds environment-configurable resource limits to the semantic chunk merge pipeline to prevent excessive embedding work on highly fragmented inputs. ChangesSemantic merge resource limits
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds guardrails to the semantic merge step to prevent excessive embedding work on highly-fragmented/adversarial inputs, along with regression tests covering the new “skip merge” paths.
Changes:
- Introduces configurable upper bounds for “tiny chunk” count and “merge candidate” count.
- Skips semantic merge (with warnings) when those bounds are exceeded.
- Adds unit tests validating normal merge behavior and both new skip conditions.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| rag-service/main.py | Adds two env-configured limits and early-return skip logic in _split_pass2 to bound semantic merge work. |
| rag-service/tests/test_semantic_merge_limits.py | Adds regression tests for normal merge and the two new resource-limit skip paths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Keep semantic merge work bounded for adversarial inputs that fragment a page | ||
| # into a large number of tiny chunks. Normal PDFs stay on the merge path. | ||
| if len(tiny_indices) > SEMANTIC_CHUNK_MAX_TINY_CHUNKS: | ||
| logger.warning( | ||
| "Semantic merge skipped tiny_chunks=%s limit=%s", | ||
| len(tiny_indices), | ||
| SEMANTIC_CHUNK_MAX_TINY_CHUNKS, | ||
| ) | ||
| return list(raw_chunks) |
| if len(sorted_indices) > SEMANTIC_CHUNK_MAX_MERGE_CANDIDATES: | ||
| logger.warning( | ||
| "Semantic merge skipped candidates=%s tiny_chunks=%s limit=%s", | ||
| len(sorted_indices), | ||
| len(tiny_indices), | ||
| SEMANTIC_CHUNK_MAX_MERGE_CANDIDATES, | ||
| ) | ||
| return list(raw_chunks) |
| def _stub_heavy_deps(): | ||
| for name in [ | ||
| "torch", | ||
| "numpy", | ||
| "langchain_community", | ||
| "langchain_community.vectorstores", | ||
| "langchain_community.embeddings", | ||
| "transformers", | ||
| "rank_bm25", | ||
| "pdf_parse_worker", | ||
| ]: | ||
| if name not in sys.modules: | ||
| sys.modules[name] = types.ModuleType(name) |
| os.environ.setdefault("JWT_SECRET", "test-secret-for-ci") | ||
| os.environ.setdefault("INTERNAL_RAG_TOKEN", "test-secret") | ||
|
|
||
| _stub_heavy_deps() |
| sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..")) | ||
| import main # noqa: E402 |
Summary
This PR adds resource-safety guardrails to the semantic chunk merging pipeline to prevent excessive embedding generation during PDF ingestion.
Previously, a highly fragmented or adversarial PDF could produce a very large number of tiny chunks, causing
_split_pass2()to generate embeddings for a large set of merge candidates. This could unnecessarily increase CPU and memory consumption during document processing.Changes Made
Added limits for semantic merge processing:
SEMANTIC_CHUNK_MAX_TINY_CHUNKSSEMANTIC_CHUNK_MAX_MERGE_CANDIDATESAdded early-exit safeguards when configured limits are exceeded.
Prevented embedding generation for oversized merge candidate sets.
Preserved existing semantic merge behavior for normal-sized documents.
Added regression tests covering:
Related issue
Fixes: #357
Testing
Verified Scenarios
Checklist:
Screenshots / recordings
N/A – Backend-only performance and stability improvement.
Notes
This change is intentionally scoped to semantic chunk merging and does not modify:
The goal is to reduce the risk of CPU and memory amplification during ingestion while preserving normal document processing behavior.
Security
Summary by CodeRabbit
New Features
Tests