fix bugs, improve LLM Schemas and LLM Processors by linminhtoo · Pull Request #964 · datalab-to/marker

linminhtoo · 2025-12-28T09:39:31Z

I made these fixes and improvements to make marker even more powerful for my side project. So far I have tested on 100+ SEC filings (10Q and 10K) of various publicly listed companies. These are extremely long documents with anywhere from 30k to 100k+ tokens with multiple tables. So far, I am very, very pleased with the results.

All changes are described in detail below.

A key finding I made was that explicitly asking the LLM to think deeper during the LLM processing steps immensely improves the results. In particular, this virtually removed all errors from the LLMTableProcessing step, especially when the initial OCR pipeline made severe mistakes. I used a locally hosted (via vLLM) Qwen/Qwen3-VL-32B-Instruct-FP8 model for this. I initially used allenai/olmOCR-2-7B-1025-FP8 , but it was struggling a lot on all tasks, usually making no corrections and claiming that everything was correct. Along with these, I also improved the various schemas and output parsing. Finally, I fixed a number of bugs.

For my experiments, I ran this script https://github.com/linminhtoo/finance-rag-assistant-v2/blob/mlin/v1.1/scripts/process_html_to_markdown.sh after downloading HTMLs using download.sh.

All in all, I've tried to keep changes backward compatible, preserving the existing marker behavior as much as possible, with the new "bells and whistles" requiring users to explicitly enable them to take effect.

Please let me know if and how I should re-run any of the benchmarks or tests to prove that these changes are not detrimental. I noticed that there is no automatic pytest CI/CD hook, maybe we want to implement that soon.

CHANGELOG

Added

Tracing/telemetry helpers via marker/telemetry.py:
- build_marker_trace_headers() to generate X-Marker-* headers (source file/path, processor, page, block, and extra context). These are extremely useful for tracing LLM calls to debug intermediate steps as well as monitor per-step latency.
- sanitize_headers() to safely strip/clip header values
Shared LLM utilities in marker/processors/llm/llm_utils.py:
- inject_analysis_prompt() with analysis_style support (summary / deep / auto)
- strip_code_fences() and string_indicates_no_corrections() helpers for more robust response parsing
analysis_style options for LLM processors (opt-in “deeper thinking” prompts), used in:
- LLMTableProcessor (marker/processors/llm/llm_table.py)
- LLMPageCorrectionProcessor (marker/processors/llm/llm_page_correction.py)
- LLMSectionHeaderProcessor (marker/processors/llm/llm_sectionheader.py)
- By allowing the LLM to "reason" by outputting text tokens into the "analysis" output field of the schema, we can drastically improve accuracy of results.
LLMSectionHeaderProcessor chunking + additional context knobs (all default to “off” to preserve existing behavior):
- max_chunk_tokens, chunk_tokenizer_hf_model_id (best-effort HF tokenizer based token counting)
- neighbor_text_max_blocks, neighbor_text_max_chars (inject nearby text around each header)
- recent_headers_max_count (carry forward recently-fixed headers as anchoring context)
- max_rewrite_retries + a score field to optionally retry low-confidence chunks

Changed

LLM prompts/schemas updated to prefer structured JSON outputs (with legacy fallbacks still handled in many places):
- LLMTableProcessor, LLMMathBlockProcessor, LLMFormProcessor, LLMPageCorrectionProcessor, LLMSectionHeaderProcessor
- In particular, added explicit correction_needed: bool fields that can be quickly checked before further parsing.
Service interface extended to support tracing headers:
- BaseService.__call__(..., extra_headers: Mapping[str, str] | None = None) in marker/services/__init__.py
- All concrete services now accept extra_headers: OpenAIService, AzureOpenAIService, ClaudeService, BaseGeminiService (and GoogleVertexService), OllamaService
- OpenAIService and AzureOpenAIService pass/sanitize extra_headers and attach default request headers (X-Title, HTTP-Referer) plus X-Marker-Block when available
LLMTableProcessor:
- Adds table chunk context to trace headers (TableChunk, RowStart/RowEnd, Iteration)
- Renames table_image_expansion_ratio → image_expansion_ratio (note: config key rename)
HTMLProvider stores source_filepath for easier debugging/tracing (marker/providers/html.py)

Fixed

LLMPageCorrectionProcessor response handling is more robust (normalizes string/list/dict responses, handles code fences and “no corrections” phrases, infers correction type when missing).
- Also fixed a bug where corrections could be wrongly skipped/mishandled just because the user did not specify their own prompt. The logic now correctly falls back to the default prompt.
PageGroup.get_image() now consistently returns a PIL Image.Image even when the stored image is bytes/memoryview (marker/schema/groups/page.py)
PromptData.schema typing corrected to type[BaseModel] (marker/processors/llm/__init__.py)
Block now includes an optional html field (used by processors like LLMFormProcessor) (marker/schema/blocks/base.py)
Minor prompt/parse correctness improvements across LLMTableProcessor, LLMMathBlockProcessor, and LLMFormProcessor (better schema adherence, code-fence stripping, “no corrections” detection)

Notes / Migration

Custom service implementations must accept the new extra_headers kwarg (added to BaseService.__call__).
If you previously configured LLMTableProcessor.table_image_expansion_ratio (this was most likely a typo bug that was not fixed), migrate to LLMTableProcessor.image_expansion_ratio.

Other notes

I implemented a custom OpenAIService wrapper at https://github.com/linminhtoo/finance-rag-assistant-v2/blob/mlin/v1.1/scripts/process_html_to_markdown.py#L46

There are some features we may want to bring in. For example, I have a knob to cap the max long side of images sent to the LLM. Without this, depending on the PDF, very large images could be sent to the LLM, resulting in thousands of image tokens which greatly slows down the requests. In my use-case, limiting the max long side it to ~1200 speeds up the requests without sacrificing accuracy.
The concept of "max retries' was a little confusing at first. The marker retry loop actually overlaps with the internal OpenAI SDK's retry mechanism, causing excessive requests to the API. We may want to either set the API SDK's retries to 0 and use the marker retry mechanism, or vice versa.

In my opinion, the most brittle step at the moment is the LLMSectionHeaderProcessor.

I wanted to make sure this step gives good results as it would affect downstream document hierarchy parsing and hierarchical chunking, where chunks for RAG are constructed based on relative section levels. An accurate hierarchy would allow for more intricate retrieval patterns during RAG - for example, fetching section-neighbor chunks, parent chunks or child chunks, which have been shown to boost RAG accuracy.
For starter's, I found that my LLM calls at this step had upwards of 15k-40k tokens, which I felt was making it too difficult for the LLM. I experimented with breaking up into chunks of ~10k tokens (this limit includes all the additional text context, so the actual number of headers in each chunk is significantly reduced). I also injected additional context around each header, namely the line height information and nearby text from the document. Finally, I also included recently-corrected headers from previous chunks into the prompt.
All these improve the results a bit, but there are still occasional inconsistencies where heading levels jump around or are not consistent across multiple pages. My main suspicion is that the LLM does not have enough "anchoring" information to decide the truly appropriate header levels, especially when there are "inconsistencies" in the OCR/recognition results.
Even so, the current processor does work reasonably well and is a great baseline.

Also, I started from #962 and removed linting (ruff formatting) commits which were a result of running pre-commit run --all. I suggest we have a separate dedicated PR just to run this ruff formatting.

…ze_block_json()

…ageCorrection If block_correction_prompt is None, it should default to default_user_prompt. However, previously, the processor would simply early-terminate without even trying to proceed. This causes an essential step in page correction to be skipped!

…ers + lint

…ders

… shared utils.py

… LLM analysis

…recent headers

linminhtoo added 30 commits December 28, 2025 16:39

fix(LLMSectionHeaderProcessor): remove incorrect extra arg to normali…

e0e1831

…ze_block_json()

fix(LLMPageCorrectionProcessor): fix types

aa629c0

feat(telemetry): add helpers to build useful HTTP headers for tracing

34a4af5

fix(LLMTableProcessor): improve TableSchema, prompt, parsing

f98d8fd

fix(PageGroup): ensure get_image() always returns PIL Image.Image

b47e3b5

fix(LLMTableProcessor): align block bbox to expanded image bbox

153f90e

fix(LLMTableProcessor): add missing helpers + fix types

3fea696

fix(LLMTableProcessor): add missing imports + fix types

8b70fcb

feat(services): fix types, lint, add headers for better tracing

1e3de80

feat(llm_meta): inject additional data for tracing headers

e2b709c

fix(schema.Block): add html field to appease pylance

eac466a

feat(HTMLProvider): track source_filepath for better debugging/tracing

3ca14be

fix(LLMMathBlockProcessor): improve LLMTextSchema, prompt, parsing

97cfae7

fix(LLMMathBlockProcessor): fix prompt

812d340

fix(LLMTableMergeProcessor): lint + add tracing headers

066794a

fix(LLMFormProcessor): improve Schema, prompt, parsing + tracing head…

0e72448

…ers + lint

fix(LLMFormProcessor): fix expected HTML in prompt

27f4522

fix(LLMPageCorrection): imrpove schema, prompt, parsing + tracing hea…

834c598

…ders

fix(LLMSectionHeader): improve schema, prompt, parsing

30fb0bf

refactor(LLMProcessor): refactor structured JSON parsing helpers into…

95315ce

… shared utils.py

fix(llm_meta,__init__): fix type of schema to type[BaseModel]

6c02584

feat(LLMSectionHeader,TableProcessor): add option to control depth of…

feb6347

… LLM analysis

feat(llm_utils): improve analysis_style prompt

d9a6d59

feat(LLMPageCorrection): add option to control analysis_style

71fc75f

feat(llm_utils): support 'deep' analysis

38cc225

feat(Table,PageCorrectionProcessors): update docstring

30f01e2

feat(LLMSectionHeader): add score field

0af8291

feat(LLMSectionHeader): add chunking + context of text, line height, …

9e0677f

…recent headers

revert(LLMtableMerge): revert to original marker file

fc25cf2

linminhtoo mentioned this pull request Dec 28, 2025

[BUG] LLMTableProcessor: image_expansion_ratio causes asymmetric cropping (right cut off) #959

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix bugs, improve LLM Schemas and LLM Processors#964

fix bugs, improve LLM Schemas and LLM Processors#964
linminhtoo wants to merge 30 commits intodatalab-to:masterfrom
linminhtoo:master

linminhtoo commented Dec 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

linminhtoo commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CHANGELOG

Added

Changed

Fixed

Notes / Migration

Other notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

linminhtoo commented Dec 28, 2025 •

edited

Loading