Match document language in LLM image descriptions by Br1an67 · Pull Request #1000 · datalab-to/marker

Br1an67 · 2026-03-01T16:16:02Z

Summary

Fix LLM-generated image descriptions ignoring document language — e.g., producing English descriptions for French PDFs.

Two changes:

Added explicit language-matching instruction (rule Poor parsing #4) to the image description prompt, telling the LLM to output descriptions in the same language as the input text
When the image itself has little/no text (< 20 chars), surrounding page text (up to 200 chars) is included as language context, so the LLM can detect the document language even for pure diagrams

Closes #954

Updated image_description_prompt in LLMImageDescriptionProcessor with language-matching instruction
Modified block_prompts() to include surrounding page text for language context when the image's own text is sparse

Add language-matching instruction to image description prompt

499a487

Br1an67 mentioned this pull request Mar 1, 2026

[BUG] Image descriptions ignore document language when using use_llm=True #954

Open