Skip to content

feat(rag): add extraction and indexing of PDF image captions#490

Merged
param20h merged 2 commits into
param20h:devfrom
nancysangani:feat/pdf-image-caption-indexing
Jun 11, 2026
Merged

feat(rag): add extraction and indexing of PDF image captions#490
param20h merged 2 commits into
param20h:devfrom
nancysangani:feat/pdf-image-caption-indexing

Conversation

@nancysangani

Copy link
Copy Markdown
Contributor

🔗 Related Issue

Closes #438


📝 What does this PR do?

Completes image caption extraction and indexing for the RAG pipeline across two files:

backend/app/rag/vision.py (rewrite of existing stub):

  • Adds extract_captions_from_pdf() — uses PyMuPDF bounding-box proximity to find
    the nearest text block below (or above) each figure. Zero cost, fully offline.
  • Fixes the broken OpenAI hook (openai.Image.create was an image-generation call,
    not a vision call) — replaced with gpt-4o-mini Chat Completions vision API,
    guarded behind VISION_PROVIDER=openai + OPENAI_API_KEY in settings.
  • Restructures caption_image() resolution order:
    OpenAI (if configured) → OCR (pytesseract) → size-aware placeholder.
  • generate_captions_for_chunks() now respects pre-extracted image_caption
    written by the ingestion pass and only falls back to OCR/placeholder for gaps.
  • Raw image_bytes are always popped from chunks in the finally block —
    they are never serialised into ChromaDB.

backend/app/services/document_ingestion.py (surgical addition):

  • Adds a proximity caption pass between chunk_document() and store_chunks().
  • Calls extract_captions_from_pdf(), matches captions to image chunks by
    (page, figure_index) order, and writes image_caption + bbox before
    store_chunks() / generate_captions_for_chunks() runs.
  • Wrapped in try/except — a caption failure never blocks ingestion.

🗂️ Type of Change

  • ✨ New feature
  • 🔧 Refactor / code cleanup

🧪 How was this tested?

  • Ran the backend locally (uvicorn app.main:app --reload)
  • Uploaded a PDF with labelled figures; confirmed ChromaDB stored chunks
    with is_image=True and non-empty image_caption matching adjacent text
  • Verified images with no adjacent text receive the size-aware placeholder
  • Confirmed decorative images below 1,000 px² are skipped
  • Confirmed raw image_bytes are never present in stored chunk metadata
  • Confirmed existing text and table chunks are unaffected (no regression)

✅ Self-Review Checklist

  • My branch is based on dev, not main
  • I have not added any secrets / API keys
  • I have not modified main branch or any HuggingFace deployment config
  • My code follows the existing style (no unnecessary formatting changes)
  • I have updated relevant docs / comments if needed

@nancysangani nancysangani requested a review from param20h as a code owner June 6, 2026 06:25
@nancysangani

Copy link
Copy Markdown
Contributor Author

Hi @param20h, Please review the PR when you get a chance. Thanks!

@nancysangani

Copy link
Copy Markdown
Contributor Author

@param20h please review this PR, all checks have passed and resolved the conflicts. Thanks!

@param20h param20h merged commit 0434eb0 into param20h:dev Jun 11, 2026
6 checks passed
@github-actions github-actions Bot added gssoc GirlScript Summer of Code 2026 issue/PR gssoc:approved Approved for GSSoC base points (+50 pts) level:advanced +55 pts mentor:param20h Mentor for this PR type:backend Backend API labels Jun 11, 2026
@github-actions

Copy link
Copy Markdown

🎉 Congratulations on getting your Pull Request merged! 🎉

Thank you for contributing to PDF-Assistant-RAG as part of GSSoC '26! 🚀

Keep up the great work! ✨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gssoc:approved Approved for GSSoC base points (+50 pts) gssoc GirlScript Summer of Code 2026 issue/PR level:advanced +55 pts mentor:param20h Mentor for this PR type:backend Backend API

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(rag): Add support for extracting and indexing image captions from PDFs

2 participants