Skip to content

fix: limit semantic chunk merge amplification#390

Open
Namraa310806 wants to merge 2 commits into
FireFistisDead:masterfrom
Namraa310806:fix/semantic-chunk-merge-limits
Open

fix: limit semantic chunk merge amplification#390
Namraa310806 wants to merge 2 commits into
FireFistisDead:masterfrom
Namraa310806:fix/semantic-chunk-merge-limits

Conversation

@Namraa310806
Copy link
Copy Markdown
Contributor

@Namraa310806 Namraa310806 commented May 30, 2026

Summary

This PR adds resource-safety guardrails to the semantic chunk merging pipeline to prevent excessive embedding generation during PDF ingestion.

Previously, a highly fragmented or adversarial PDF could produce a very large number of tiny chunks, causing _split_pass2() to generate embeddings for a large set of merge candidates. This could unnecessarily increase CPU and memory consumption during document processing.

Changes Made

  • Added limits for semantic merge processing:

    • SEMANTIC_CHUNK_MAX_TINY_CHUNKS
    • SEMANTIC_CHUNK_MAX_MERGE_CANDIDATES
  • Added early-exit safeguards when configured limits are exceeded.

  • Prevented embedding generation for oversized merge candidate sets.

  • Preserved existing semantic merge behavior for normal-sized documents.

  • Added regression tests covering:

    • normal semantic merge behavior
    • tiny chunk limit enforcement
    • merge candidate limit enforcement
    • verification that embeddings are not generated when limits are exceeded

Related issue

Fixes: #357

Testing

  • I ran the relevant checks locally
  • I verified the app still starts
  • I tested the affected flow end-to-end

Verified Scenarios

  • Normal document ingestion continues to perform semantic chunk merging.
  • Small chunk sets continue to merge correctly.
  • Excessive tiny chunk counts trigger safe fallback behavior.
  • Excessive merge candidate counts skip semantic merging.
  • Embedding generation is avoided when configured limits are exceeded.
  • Existing PDF processing functionality remains operational.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas

Screenshots / recordings

N/A – Backend-only performance and stability improvement.

Notes

This change is intentionally scoped to semantic chunk merging and does not modify:

  • retrieval logic
  • embeddings model behavior
  • vectorstore persistence
  • session handling
  • answer generation

The goal is to reduce the risk of CPU and memory amplification during ingestion while preserving normal document processing behavior.

Security

  • No sensitive data included

Summary by CodeRabbit

  • New Features

    • Added configurable limits to the semantic chunking pipeline to avoid excessive computation; merge step is skipped when thresholds are exceeded and a warning is logged.
    • Exposed two environment variables to tune those thresholds.
  • Tests

    • Added tests validating normal merging and the two limit-triggered skip behaviors.

Copilot AI review requested due to automatic review settings May 30, 2026 19:52
@vercel
Copy link
Copy Markdown

vercel Bot commented May 30, 2026

@Namraa310806 is attempting to deploy a commit to the firefistisdead's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 30, 2026

Review Change Stack

Caution

Review failed

Failed to post review comments

📝 Walkthrough

Walkthrough

The PR adds environment-configurable resource limits to the semantic chunk merge pipeline to prevent excessive embedding work on highly fragmented inputs. _split_pass2 now skips embedding-based merges and returns raw_chunks when tiny-chunk or merge-candidate counts exceed configured thresholds. A new test module verifies normal merging and both skip paths.

Changes

Semantic merge resource limits

Layer / File(s) Summary
Configuration and limit enforcement
rag-service/main.py
Adds SEMANTIC_CHUNK_MAX_TINY_CHUNKS and SEMANTIC_CHUNK_MAX_MERGE_CANDIDATES (env-loaded) and early-return checks in _split_pass2 that log a warning and return raw_chunks when either threshold is exceeded.
Test validation
rag-service/tests/test_semantic_merge_limits.py
New tests stub heavy deps, set env, provide _FakeEmbeddingModel, and assert _split_pass2 merges for small inputs and skips (without calling embeddings) when tiny-chunk or candidate limits are exceeded.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • FireFistisDead/pdf-qa-bot#207: Introduces the 2-phase semantic merge pipeline in _split_pass2; this PR adds resource-limit safeguards to the same pass-2 implementation.

Suggested labels

bug

Poem

🐰 A fragmented PDF hopped in with glee,
I counted the chunks beneath a tall tree.
If tiny bits swarm or candidates climb,
I skip the big embed — save CPU time.
Tests cheer, and the service stays snappy as me!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: limit semantic chunk merge amplification' clearly and concisely summarizes the main change—adding limits to prevent resource exhaustion during semantic chunk merging.
Description check ✅ Passed The pull request description is comprehensive and well-structured, covering summary, related issue, testing verification, checklist items, and security considerations aligned with the template.
Linked Issues check ✅ Passed The PR successfully addresses all primary objectives from issue #357: added configurable limits (SEMANTIC_CHUNK_MAX_TINY_CHUNKS, SEMANTIC_CHUNK_MAX_MERGE_CANDIDATES), implemented early-exit safeguards, prevented embedding generation when limits exceeded, and preserved normal semantic merge behavior.
Out of Scope Changes check ✅ Passed All changes are scoped to semantic chunk merging in main.py and new tests; no modifications to retrieval logic, embeddings behavior, vectorstore persistence, session handling, or answer generation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added backend Express or API gateway work docs Documentation only enhancement New feature or request feature A new feature or improvement fix A targeted fix or cleanup frontend Frontend-related work level:advanced rag-service FastAPI / model service work type:security type:testing labels May 30, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds guardrails to the semantic merge step to prevent excessive embedding work on highly-fragmented/adversarial inputs, along with regression tests covering the new “skip merge” paths.

Changes:

  • Introduces configurable upper bounds for “tiny chunk” count and “merge candidate” count.
  • Skips semantic merge (with warnings) when those bounds are exceeded.
  • Adds unit tests validating normal merge behavior and both new skip conditions.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
rag-service/main.py Adds two env-configured limits and early-return skip logic in _split_pass2 to bound semantic merge work.
rag-service/tests/test_semantic_merge_limits.py Adds regression tests for normal merge and the two new resource-limit skip paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rag-service/main.py
Comment on lines +2123 to +2131
# Keep semantic merge work bounded for adversarial inputs that fragment a page
# into a large number of tiny chunks. Normal PDFs stay on the merge path.
if len(tiny_indices) > SEMANTIC_CHUNK_MAX_TINY_CHUNKS:
logger.warning(
"Semantic merge skipped tiny_chunks=%s limit=%s",
len(tiny_indices),
SEMANTIC_CHUNK_MAX_TINY_CHUNKS,
)
return list(raw_chunks)
Comment thread rag-service/main.py
Comment on lines +2144 to +2151
if len(sorted_indices) > SEMANTIC_CHUNK_MAX_MERGE_CANDIDATES:
logger.warning(
"Semantic merge skipped candidates=%s tiny_chunks=%s limit=%s",
len(sorted_indices),
len(tiny_indices),
SEMANTIC_CHUNK_MAX_MERGE_CANDIDATES,
)
return list(raw_chunks)
Comment on lines +10 to +22
def _stub_heavy_deps():
for name in [
"torch",
"numpy",
"langchain_community",
"langchain_community.vectorstores",
"langchain_community.embeddings",
"transformers",
"rank_bm25",
"pdf_parse_worker",
]:
if name not in sys.modules:
sys.modules[name] = types.ModuleType(name)
Comment on lines +55 to +58
os.environ.setdefault("JWT_SECRET", "test-secret-for-ci")
os.environ.setdefault("INTERNAL_RAG_TOKEN", "test-secret")

_stub_heavy_deps()
Comment on lines +60 to +61
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
import main # noqa: E402
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend Express or API gateway work docs Documentation only enhancement New feature or request feature A new feature or improvement fix A targeted fix or cleanup frontend Frontend-related work level:advanced level:intermediate rag-service FastAPI / model service work type:security type:testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Performance]: Semantic chunk merge may cause CPU/memory exhaustion on crafted PDFs

2 participants