Feat/vector retriever by seyoung4503 · Pull Request #211 · CausalInferenceLab/Lang2SQL

seyoung4503 · 2026-02-27T10:08:14Z

#️⃣ Issue Number

TBD

📝 요약(Summary)

Why: 키워드 기반 검색(BM25)만으로는 의미론적으로 유사하지만 키워드가 다른 쿼리를 처리하지 못하는 한계가 있어, 벡터 기반 및 하이브리드 검색 기능을 추가함
How: VectorRetriever, HybridRetriever(RRF), DocumentLoader, Chunker 등 검색 파이프라인 전반을 신규 구현하고, from_chunks() API를 표준 진입점으로 확립함. BaselineNL2SQL에 retriever 주입을 노출하여 엔드투엔드 파이프라인과 통합함

💬 To Reviewers (선택)

VectorRetriever.from_chunks() 를 공개 API 진입점으로 삼고 IndexBuilder를 public API에서 제거한 설계 결정이 적절한지 확인 부탁드립니다.
HybridRetriever의 RRF(Reciprocal Rank Fusion) 병합 로직과 splitter 파라미터 네이밍(기존 document_chunker → splitter)이 기존 컨벤션과 일관성이 있는지 검토 부탁드립니다.

PR Checklist

reference) How to Code Review

따봉(👍): 리뷰어가 리뷰이의 코드에서 칭찬의 의견을 남기고 싶을 때 사용합니다.
느낌표(❗): 리뷰어가 리뷰이에게 필수적으로 코드 수정을 요청할 때 사용합니다.
물음표 (❓): 리뷰어가 리뷰이에게 의견을 물어보고 싶을 때 사용합니다.
알약 (💊): 리뷰어가 리뷰이의 코드에서 개선된 방법을 제안하지만 그것의 반영이 필수까지는 아닐 때 사용합니다.

… VectorStorePort

…gChunker, RecursiveCharacterChunker

…cChunker

- Add HybridRetriever: BM25 + vector search merged via Reciprocal Rank Fusion (k=60), over-fetching top_n*2 from each retriever - Add HybridNL2SQL flow: HybridRetriever → SQLGenerator → SQLExecutor - Clean up BaselineNL2SQL: remove retriever= injection and isinstance branch - Export HybridRetriever, HybridNL2SQL from public API - Add 8 unit tests (99 total)

- Add DocumentLoaderPort protocol to core/ports.py - Add MarkdownLoader: .md file/directory → list[TextDocument], stdlib only - Add PlainTextLoader: .txt file/directory → list[TextDocument], stdlib only - Add DirectoryLoader: extension-based dispatch across a directory tree - Add PDFLoader (integrations): pymupdf optional, one doc per page - Export new loaders from public API - Add 8 unit tests (107 total)

…iveCharacterChunker LangChain-style convenience: split(list) → list[IndexedChunk] wraps the existing chunk() per-item method.

Consistent with CatalogChunker and RecursiveCharacterChunker so all chunkers share the same split(docs) interface.

…) internals - from_chunks(): LangChain-style factory that takes pre-split IndexedChunks - from_sources(): now delegates to from_chunks() internally; renames document_chunker → splitter parameter - add(): signature changed from list[TextDocument] to list[IndexedChunk]; removes RuntimeError guard and _index_builder reference

Aligns HybridRetriever with the updated VectorRetriever.from_sources() API.

IndexBuilder is superseded by the from_chunks() explicit pipeline. Delete index_builder.py and remove all imports/exports from retrieval/__init__.py and the top-level __init__.py.

- Replace IndexBuilder-based tests (8, 9, 10, 14) with from_chunks() equivalents - Update test 15: remove _index_builder assertion - Update test 16: add() now requires pre-split IndexedChunks - Add 3 new tests: from_chunks_empty, from_chunks_mixed, from_chunks_add_incremental - Add test for CatalogChunker.split() batch method

- Section 0/1: remove IndexBuilder mention, add .split() description and PDFLoader - Section 2: update 'advanced control' to reference from_chunks() pattern - Section 3C: rewrite explicit pipeline example using split() + from_chunks() - Section 3D (new): DirectoryLoader → HybridNL2SQL direct pattern - Section 3E (new): PDFLoader usage with DirectoryLoader - Section 4: add constraint note for add() requiring pre-split IndexedChunks

seyoung4503 added 21 commits February 25, 2026 17:29

feat(core): add TextDocument, IndexedChunk, RetrievalResult types and…

642c501

… VectorStorePort

feat(components/retrieval): add VectorRetriever, IndexBuilder, Catalo…

0dbe74f

…gChunker, RecursiveCharacterChunker

feat(integrations): add InMemoryVectorStore, OpenAIEmbedding, Semanti…

87e6ccb

…cChunker

feat(generation): add context parameter to SQLGenerator

892c30f

feat(flows): expose retriever injection in BaselineNL2SQL

dc10509

chore: export new public API symbols in __init__.py

d88fca4

test(components/retrieval): add 16 unit tests for VectorRetriever

62d5f83

docs(tutorials): add vector-retriever tutorial

98af225

feat(chunker): add .split() batch method to CatalogChunker and Recurs…

bc8b9fa

…iveCharacterChunker LangChain-style convenience: split(list) → list[IndexedChunk] wraps the existing chunk() per-item method.

feat(chunking): add .split() batch method to SemanticChunker

c78ba93

Consistent with CatalogChunker and RecursiveCharacterChunker so all chunkers share the same split(docs) interface.

refactor(hybrid): rename document_chunker → splitter parameter

b9eda9f

Aligns HybridRetriever with the updated VectorRetriever.from_sources() API.

refactor(retrieval): remove IndexBuilder from public API

6daf36b

IndexBuilder is superseded by the from_chunks() explicit pipeline. Delete index_builder.py and remove all imports/exports from retrieval/__init__.py and the top-level __init__.py.

docs(complete-tutorial): update all sections for from_chunks() API

34d6a93

docs : setup sample docs added

09e1a6c

style: apply black formatting to unmodified source files

7d88ff9

style: apply black formatting to unmodified source files

dd723b8

seyoung4503 merged commit 068b70d into master Feb 27, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/vector retriever#211

Feat/vector retriever#211
seyoung4503 merged 21 commits intomasterfrom
feat/vector-retriever

seyoung4503 commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

seyoung4503 commented Feb 27, 2026

#️⃣ Issue Number

📝 요약(Summary)

💬 To Reviewers (선택)

PR Checklist

reference) How to Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant