Skip to content

Match document language in LLM image descriptions#1000

Open
Br1an67 wants to merge 1 commit intodatalab-to:masterfrom
Br1an67:fix/issue-954-image-desc-language
Open

Match document language in LLM image descriptions#1000
Br1an67 wants to merge 1 commit intodatalab-to:masterfrom
Br1an67:fix/issue-954-image-desc-language

Conversation

@Br1an67
Copy link

@Br1an67 Br1an67 commented Mar 1, 2026

Summary

Fix LLM-generated image descriptions ignoring document language — e.g., producing English descriptions for French PDFs.

Two changes:

  1. Added explicit language-matching instruction (rule Poor parsing #4) to the image description prompt, telling the LLM to output descriptions in the same language as the input text
  2. When the image itself has little/no text (< 20 chars), surrounding page text (up to 200 chars) is included as language context, so the LLM can detect the document language even for pure diagrams

Closes #954

Changes

  • Updated image_description_prompt in LLMImageDescriptionProcessor with language-matching instruction
  • Modified block_prompts() to include surrounding page text for language context when the image's own text is sparse

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Image descriptions ignore document language when using use_llm=True

1 participant