Skip to content

[BUGFIX] : Leaking Soft-Deleted Documents in Global Chat RAG Retrieval #517

Open
hrshjswniii wants to merge 3 commits into
param20h:devfrom
hrshjswniii:bugfix/Leaking-Documents
Open

[BUGFIX] : Leaking Soft-Deleted Documents in Global Chat RAG Retrieval #517
hrshjswniii wants to merge 3 commits into
param20h:devfrom
hrshjswniii:bugfix/Leaking-Documents

Conversation

@hrshjswniii

Copy link
Copy Markdown
Contributor

🔗 Related Issue


Closes #486



📝 What does this PR do?


This PR resolves the data-isolation / security bug where soft-deleted documents (where is_deleted = True in SQLite) were leaking into global/unscoped chatbot retrieval:

  • Active Document ID Retrieval: Modified the global retrieve function in retriever.py to check the database for active document IDs. If no specific document_id is supplied, it queries for a whitelist of all non-deleted (is_deleted=False) document IDs belonging to the querying user.
  • ChromaDB Filtering: Passed the active document ID whitelist into CustomVectorRetriever as document_ids, which filters vector store queries via where_filter = {"document_id": {"$in": document_ids}}.
  • BM25 Filtering: Passed the active document ID whitelist into CustomBM25Retriever as document_ids and updated query_bm25 in bm25.py to skip index .pkl files that do not match the active list.
  • Unit Testing: Added test_retrieve_excludes_soft_deleted_documents to test_retriever.py to assert that retrieval ignores soft-deleted documents under general/global chat scopes.


🗂️ Type of Change


  • 🐛 Bug fix
  • 🧪 Tests


🧪 How was this tested?


  • Tested the affected API endpoints manually
  • Added / updated tests (Created test_retrieve_excludes_soft_deleted_documents in backend/tests/test_retriever.py)
  • Ran full backend test suite (.venv/Scripts/pytest — all 120 tests passed)


⚠️ Anything to flag for reviewers?

  • None.


✅ Self-Review Checklist


  • My branch is based on dev, not main
  • I have not added any secrets / API keys
  • I have not modified main branch or any HuggingFace deployment config
  • My code follows the existing style (no unnecessary formatting changes)
  • I have updated relevant docs / comments if needed

@hrshjswniii hrshjswniii requested a review from param20h as a code owner June 7, 2026 14:10
@hrshjswniii hrshjswniii force-pushed the bugfix/Leaking-Documents branch from d058744 to 0ac2bcb Compare June 7, 2026 14:12
@hrshjswniii hrshjswniii changed the title Bugfix/leaking documents [BUGFIX] : Leaking Soft-Deleted Documents in Global Chat RAG Retrieval Jun 7, 2026
@param20h

param20h commented Jun 7, 2026

Copy link
Copy Markdown
Owner

Merge COnflicts @hrshjswniii Solve them and ping me

@hrshjswniii

Copy link
Copy Markdown
Contributor Author

Hii @param20h , merge conflicts has been resolved
do review and feel free to merge the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] : Leaking Soft-Deleted Documents in Global Chat RAG Retrieval

2 participants