Skip to content

Deduplicate image description by moving it to img alt text#999

Open
Br1an67 wants to merge 1 commit intodatalab-to:masterfrom
Br1an67:fix/issue-955-deduplicate-image-desc
Open

Deduplicate image description by moving it to img alt text#999
Br1an67 wants to merge 1 commit intodatalab-to:masterfrom
Br1an67:fix/issue-955-deduplicate-image-desc

Conversation

@Br1an67
Copy link

@Br1an67 Br1an67 commented Mar 1, 2026

Summary

Fix image descriptions appearing twice in output when use_llm=True — once as a separate text paragraph and once embedded with the image.

When the LLM generates an image description, Picture/Figure.assemble_html() creates a <p role='img'> paragraph with the description. When images are also being extracted, this description paragraph ends up in the output alongside the <img> tag, causing duplication.

The fix extracts the description from the <p role='img'> tag, moves it into the <img> tag's alt attribute, and removes the separate paragraph. This produces clean output like:

Instead of the duplicated:

Image ... description: Diagram showing battery modules connected in parallel.

Closes #955

Changes

  • Modified HTMLRenderer.extract_html() to extract description from p[role='img'] and use it as the <img> alt text
  • Description paragraph is removed from content to prevent duplication
  • Alt text is properly HTML-escaped

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG: Output] Image descriptions appear twice in output (alt text + separate line)

1 participant