Skip to content

fix bugs, improve LLM Schemas and LLM Processors#964

Open
linminhtoo wants to merge 30 commits intodatalab-to:masterfrom
linminhtoo:master
Open

fix bugs, improve LLM Schemas and LLM Processors#964
linminhtoo wants to merge 30 commits intodatalab-to:masterfrom
linminhtoo:master

Conversation

@linminhtoo
Copy link

@linminhtoo linminhtoo commented Dec 28, 2025

I made these fixes and improvements to make marker even more powerful for my side project. So far I have tested on 100+ SEC filings (10Q and 10K) of various publicly listed companies. These are extremely long documents with anywhere from 30k to 100k+ tokens with multiple tables. So far, I am very, very pleased with the results.

All changes are described in detail below.

A key finding I made was that explicitly asking the LLM to think deeper during the LLM processing steps immensely improves the results. In particular, this virtually removed all errors from the LLMTableProcessing step, especially when the initial OCR pipeline made severe mistakes. I used a locally hosted (via vLLM) Qwen/Qwen3-VL-32B-Instruct-FP8 model for this. I initially used allenai/olmOCR-2-7B-1025-FP8 , but it was struggling a lot on all tasks, usually making no corrections and claiming that everything was correct. Along with these, I also improved the various schemas and output parsing. Finally, I fixed a number of bugs.

For my experiments, I ran this script https://github.com/linminhtoo/finance-rag-assistant-v2/blob/mlin/v1.1/scripts/process_html_to_markdown.sh after downloading HTMLs using download.sh.

All in all, I've tried to keep changes backward compatible, preserving the existing marker behavior as much as possible, with the new "bells and whistles" requiring users to explicitly enable them to take effect.

Please let me know if and how I should re-run any of the benchmarks or tests to prove that these changes are not detrimental. I noticed that there is no automatic pytest CI/CD hook, maybe we want to implement that soon.


CHANGELOG

Added

  • Tracing/telemetry helpers via marker/telemetry.py:
    • build_marker_trace_headers() to generate X-Marker-* headers (source file/path, processor, page, block, and extra context). These are extremely useful for tracing LLM calls to debug intermediate steps as well as monitor per-step latency.
    • sanitize_headers() to safely strip/clip header values
  • Shared LLM utilities in marker/processors/llm/llm_utils.py:
    • inject_analysis_prompt() with analysis_style support (summary / deep / auto)
    • strip_code_fences() and string_indicates_no_corrections() helpers for more robust response parsing
  • analysis_style options for LLM processors (opt-in “deeper thinking” prompts), used in:
    • LLMTableProcessor (marker/processors/llm/llm_table.py)
    • LLMPageCorrectionProcessor (marker/processors/llm/llm_page_correction.py)
    • LLMSectionHeaderProcessor (marker/processors/llm/llm_sectionheader.py)
    • By allowing the LLM to "reason" by outputting text tokens into the "analysis" output field of the schema, we can drastically improve accuracy of results.
  • LLMSectionHeaderProcessor chunking + additional context knobs (all default to “off” to preserve existing behavior):
    • max_chunk_tokens, chunk_tokenizer_hf_model_id (best-effort HF tokenizer based token counting)
    • neighbor_text_max_blocks, neighbor_text_max_chars (inject nearby text around each header)
    • recent_headers_max_count (carry forward recently-fixed headers as anchoring context)
    • max_rewrite_retries + a score field to optionally retry low-confidence chunks

Changed

  • LLM prompts/schemas updated to prefer structured JSON outputs (with legacy fallbacks still handled in many places):
    • LLMTableProcessor, LLMMathBlockProcessor, LLMFormProcessor, LLMPageCorrectionProcessor, LLMSectionHeaderProcessor
    • In particular, added explicit correction_needed: bool fields that can be quickly checked before further parsing.
  • Service interface extended to support tracing headers:
    • BaseService.__call__(..., extra_headers: Mapping[str, str] | None = None) in marker/services/__init__.py
    • All concrete services now accept extra_headers: OpenAIService, AzureOpenAIService, ClaudeService, BaseGeminiService (and GoogleVertexService), OllamaService
    • OpenAIService and AzureOpenAIService pass/sanitize extra_headers and attach default request headers (X-Title, HTTP-Referer) plus X-Marker-Block when available
  • LLMTableProcessor:
    • Adds table chunk context to trace headers (TableChunk, RowStart/RowEnd, Iteration)
    • Renames table_image_expansion_ratioimage_expansion_ratio (note: config key rename)
  • HTMLProvider stores source_filepath for easier debugging/tracing (marker/providers/html.py)

Fixed

  • LLMPageCorrectionProcessor response handling is more robust (normalizes string/list/dict responses, handles code fences and “no corrections” phrases, infers correction type when missing).
    • Also fixed a bug where corrections could be wrongly skipped/mishandled just because the user did not specify their own prompt. The logic now correctly falls back to the default prompt.
  • PageGroup.get_image() now consistently returns a PIL Image.Image even when the stored image is bytes/memoryview (marker/schema/groups/page.py)
  • PromptData.schema typing corrected to type[BaseModel] (marker/processors/llm/__init__.py)
  • Block now includes an optional html field (used by processors like LLMFormProcessor) (marker/schema/blocks/base.py)
  • Minor prompt/parse correctness improvements across LLMTableProcessor, LLMMathBlockProcessor, and LLMFormProcessor (better schema adherence, code-fence stripping, “no corrections” detection)

Notes / Migration

  • Custom service implementations must accept the new extra_headers kwarg (added to BaseService.__call__).
  • If you previously configured LLMTableProcessor.table_image_expansion_ratio (this was most likely a typo bug that was not fixed), migrate to LLMTableProcessor.image_expansion_ratio.

Other notes

I implemented a custom OpenAIService wrapper at https://github.com/linminhtoo/finance-rag-assistant-v2/blob/mlin/v1.1/scripts/process_html_to_markdown.py#L46

  • There are some features we may want to bring in. For example, I have a knob to cap the max long side of images sent to the LLM. Without this, depending on the PDF, very large images could be sent to the LLM, resulting in thousands of image tokens which greatly slows down the requests. In my use-case, limiting the max long side it to ~1200 speeds up the requests without sacrificing accuracy.
  • The concept of "max retries' was a little confusing at first. The marker retry loop actually overlaps with the internal OpenAI SDK's retry mechanism, causing excessive requests to the API. We may want to either set the API SDK's retries to 0 and use the marker retry mechanism, or vice versa.

In my opinion, the most brittle step at the moment is the LLMSectionHeaderProcessor.

  • I wanted to make sure this step gives good results as it would affect downstream document hierarchy parsing and hierarchical chunking, where chunks for RAG are constructed based on relative section levels. An accurate hierarchy would allow for more intricate retrieval patterns during RAG - for example, fetching section-neighbor chunks, parent chunks or child chunks, which have been shown to boost RAG accuracy.
  • For starter's, I found that my LLM calls at this step had upwards of 15k-40k tokens, which I felt was making it too difficult for the LLM. I experimented with breaking up into chunks of ~10k tokens (this limit includes all the additional text context, so the actual number of headers in each chunk is significantly reduced). I also injected additional context around each header, namely the line height information and nearby text from the document. Finally, I also included recently-corrected headers from previous chunks into the prompt.
  • All these improve the results a bit, but there are still occasional inconsistencies where heading levels jump around or are not consistent across multiple pages. My main suspicion is that the LLM does not have enough "anchoring" information to decide the truly appropriate header levels, especially when there are "inconsistencies" in the OCR/recognition results.
  • Even so, the current processor does work reasonably well and is a great baseline.

Also, I started from #962 and removed linting (ruff formatting) commits which were a result of running pre-commit run --all. I suggest we have a separate dedicated PR just to run this ruff formatting.

…ageCorrection

If block_correction_prompt is None, it should default to default_user_prompt. However, previously, the processor would simply early-terminate without even trying to proceed. This causes an essential step in page correction to be skipped!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant