Title: [FEAT] Implement Hybrid Search Combining Dense Vector (ChromaDB) and Sparse Keyword (BM25)
Is your feature request related to a problem? Please describe.
Semantic search sometimes misses exact keyword matching (such as specific invoice numbers, technical serial codes, or medication dosages) because the embedding model maps them close to general synonyms.
Describe the solution you'd like
Implement a hybrid search pipeline:
- Add
rank_bm25 to the backend dependencies.
- During ingestion, construct a BM25 index over the document chunks.
- During retrieval, run both ChromaDB vector search and BM25 search.
- Blend the scores of both retrieval methods using Reciprocal Rank Fusion (RRF) to produce a combined ranked list.
- Pass the top-ranked hybrid results to the Cross-Encoder reranker.
Describe alternatives you've considered
Fine-tuning the embedding model, but this is extremely resource-intensive and does not guarantee exact-string match accuracy.
Additional Context
- GSSoC '26: Yes, I am participating in GirlScript Summer of Code and would like to build this.
- Level: Advanced
- Affected Files:
backend/app/rag/retriever.py, backend/requirements.txt
Title: [FEAT] Implement Hybrid Search Combining Dense Vector (ChromaDB) and Sparse Keyword (BM25)
Is your feature request related to a problem? Please describe.
Semantic search sometimes misses exact keyword matching (such as specific invoice numbers, technical serial codes, or medication dosages) because the embedding model maps them close to general synonyms.
Describe the solution you'd like
Implement a hybrid search pipeline:
rank_bm25to the backend dependencies.Describe alternatives you've considered
Fine-tuning the embedding model, but this is extremely resource-intensive and does not guarantee exact-string match accuracy.
Additional Context
backend/app/rag/retriever.py,backend/requirements.txt