Skip to content

[FEAT] Implement Hybrid Search Combining Dense Vector (ChromaDB) and Sparse Keyword (BM25) #565

@knoxiboy

Description

@knoxiboy

Title: [FEAT] Implement Hybrid Search Combining Dense Vector (ChromaDB) and Sparse Keyword (BM25)

Is your feature request related to a problem? Please describe.

Semantic search sometimes misses exact keyword matching (such as specific invoice numbers, technical serial codes, or medication dosages) because the embedding model maps them close to general synonyms.

Describe the solution you'd like

Implement a hybrid search pipeline:

  1. Add rank_bm25 to the backend dependencies.
  2. During ingestion, construct a BM25 index over the document chunks.
  3. During retrieval, run both ChromaDB vector search and BM25 search.
  4. Blend the scores of both retrieval methods using Reciprocal Rank Fusion (RRF) to produce a combined ranked list.
  5. Pass the top-ranked hybrid results to the Cross-Encoder reranker.

Describe alternatives you've considered

Fine-tuning the embedding model, but this is extremely resource-intensive and does not guarantee exact-string match accuracy.

Additional Context

  • GSSoC '26: Yes, I am participating in GirlScript Summer of Code and would like to build this.
  • Level: Advanced
  • Affected Files: backend/app/rag/retriever.py, backend/requirements.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    gssocGirlScript Summer of Code 2026 issue/PR

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions