Add chunking and multi-query retrieval by stevez · Pull Request #2 · stevez/pdf-chatbot

stevez · 2026-04-10T05:55:07Z

Summary

RecursiveCharacterTextSplitter in ingestion pipeline — chunks PDFs into 1000-char segments with 200-char overlap instead of storing whole pages
Multi-query retrieval — generates 3 query variants via LLM, retrieves docs for all queries in parallel, and deduplicates results

Why

Smaller chunks improve retrieval precision (less noise per chunk)
Multiple query variants improve recall (catches docs a single query would miss)

Test plan

Clear Supabase documents table
Re-upload a PDF and verify row count increased (more chunks)
Ask a question and verify the answer quality improves
Backend unit tests pass (6/6)

🤖 Generated with Claude Code

- Chunk documents (1000 chars, 200 overlap) before embedding for better retrieval precision - Generate multiple query variants using LLM to improve recall - Retrieve docs for all queries in parallel and deduplicate results Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Limit retrieved docs to k*2 to reduce noise in sources - Click source cards to expand and view the actual chunk text Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add RecursiveCharacterTextSplitter and multi-query retrieval - Chunk documents (1000 chars, 200 overlap) before embedding for better retrieval precision - Generate multiple query variants using LLM to improve recall - Retrieve docs for all queries in parallel and deduplicate results Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Cap multi-query results and add expandable source citations - Limit retrieved docs to k*2 to reduce noise in sources - Click source cards to expand and view the actual chunk text Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

stevez and others added 2 commits April 10, 2026 01:54

Cap multi-query results and add expandable source citations

b55ea63

- Limit retrieved docs to k*2 to reduce noise in sources - Click source cards to expand and view the actual chunk text Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

stevez merged commit 68bf809 into main Apr 10, 2026
1 check passed

stevez deleted the feat/multi-query-and-chunking branch April 10, 2026 06:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add chunking and multi-query retrieval#2

Add chunking and multi-query retrieval#2
stevez merged 2 commits into
mainfrom
feat/multi-query-and-chunking

stevez commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stevez commented Apr 10, 2026

Summary

Why

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant