Skip to content

Add Semantic Boundary Chunking Strategy#117

Open
jlee600 wants to merge 6 commits into
mainfrom
semantic-chunking
Open

Add Semantic Boundary Chunking Strategy#117
jlee600 wants to merge 6 commits into
mainfrom
semantic-chunking

Conversation

@jlee600
Copy link
Copy Markdown
Contributor

@jlee600 jlee600 commented Apr 16, 2026

The existing recursive chunker splits text based on hard character counts. This fractures sentences mid-thought and feeds broken context into the vector database, which was causing the Qwen model to struggle with conceptual answers and hallucinate incorrect database properties.

Implementation

  • Added an SemanticBoundaryStrategy to chunk text based on meaning rather than character limits.
  • Integrated the all-MiniLM-L6-v2 encoder to compute cosine similarity between consecutive sentences
  • The strategy only executes a split when it detects a mathematical shift in the topic (similarity drops below the threshold), ensuring all chunks contain complete, unbroken thoughts.
  • Defined fallback safety parameters in config.py: similarity_threshold (0.55) and min_chunk_size (100) to prevent micro-chunks.

Benchmarks

  • Full textbook database successfully indexed in 42 minutes.
  • Evaluated against the benchmark.yaml using Semantic metric
  • Pipeline average improved from 0.781 to 0.814.
  • Aggregation with Grouping: 0.616 -> 0.766 (+15.0%)
  • SQL Isolation Guarantees: 0.778 -> 0.879 (+10.1%)
  • ARIES Recovery: 0.872 -> 0.934 (+6.2%)
  • Primary & Foreign Keys: 0.868 -> 0.927 (+5.9%)

@jlee600 jlee600 added the enhancement New feature or request label Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant