Add Semantic Boundary Chunking Strategy by jlee600 · Pull Request #117 · georgia-tech-db/TokenSmith

jlee600 · 2026-04-16T00:11:43Z

The existing recursive chunker splits text based on hard character counts. This fractures sentences mid-thought and feeds broken context into the vector database, which was causing the Qwen model to struggle with conceptual answers and hallucinate incorrect database properties.

Implementation

Added an SemanticBoundaryStrategy to chunk text based on meaning rather than character limits.
Integrated the all-MiniLM-L6-v2 encoder to compute cosine similarity between consecutive sentences
The strategy only executes a split when it detects a mathematical shift in the topic (similarity drops below the threshold), ensuring all chunks contain complete, unbroken thoughts.
Defined fallback safety parameters in config.py: similarity_threshold (0.55) and min_chunk_size (100) to prevent micro-chunks.

Benchmarks

Full textbook database successfully indexed in 42 minutes.
Evaluated against the benchmark.yaml using Semantic metric
Pipeline average improved from 0.781 to 0.814.
Aggregation with Grouping: 0.616 -> 0.766 (+15.0%)
SQL Isolation Guarantees: 0.778 -> 0.879 (+10.1%)
ARIES Recovery: 0.872 -> 0.934 (+6.2%)
Primary & Foreign Keys: 0.868 -> 0.927 (+5.9%)

jlee600 added 6 commits April 8, 2026 21:28

Add semantic boundary chunking

f13c708

add index dir name

e922513

Merge remote-tracking branch 'origin/main' into semantic-chunking

3858dff

Add min_chunk_size to semantic chunking

e857328

Merge branch 'main' into semantic-chunking

8ba5c94

ignore build_index propagated files

0b6f54d

jlee600 added the enhancement New feature or request label Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Semantic Boundary Chunking Strategy#117

Add Semantic Boundary Chunking Strategy#117
jlee600 wants to merge 6 commits into
mainfrom
semantic-chunking

jlee600 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlee600 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant