fix bugs, improve LLM Schemas and LLM Processors#964
Open
linminhtoo wants to merge 30 commits intodatalab-to:masterfrom
Open
fix bugs, improve LLM Schemas and LLM Processors#964linminhtoo wants to merge 30 commits intodatalab-to:masterfrom
linminhtoo wants to merge 30 commits intodatalab-to:masterfrom
Conversation
…ageCorrection If block_correction_prompt is None, it should default to default_user_prompt. However, previously, the processor would simply early-terminate without even trying to proceed. This causes an essential step in page correction to be skipped!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I made these fixes and improvements to make
markereven more powerful for my side project. So far I have tested on 100+ SEC filings (10Q and 10K) of various publicly listed companies. These are extremely long documents with anywhere from 30k to 100k+ tokens with multiple tables. So far, I am very, very pleased with the results.All changes are described in detail below.
A key finding I made was that explicitly asking the LLM to think deeper during the LLM processing steps immensely improves the results. In particular, this virtually removed all errors from the LLMTableProcessing step, especially when the initial OCR pipeline made severe mistakes. I used a locally hosted (via vLLM) Qwen/Qwen3-VL-32B-Instruct-FP8 model for this. I initially used allenai/olmOCR-2-7B-1025-FP8 , but it was struggling a lot on all tasks, usually making no corrections and claiming that everything was correct. Along with these, I also improved the various schemas and output parsing. Finally, I fixed a number of bugs.
For my experiments, I ran this script https://github.com/linminhtoo/finance-rag-assistant-v2/blob/mlin/v1.1/scripts/process_html_to_markdown.sh after downloading HTMLs using download.sh.
All in all, I've tried to keep changes backward compatible, preserving the existing
markerbehavior as much as possible, with the new "bells and whistles" requiring users to explicitly enable them to take effect.Please let me know if and how I should re-run any of the benchmarks or tests to prove that these changes are not detrimental. I noticed that there is no automatic pytest CI/CD hook, maybe we want to implement that soon.
CHANGELOG
Added
marker/telemetry.py:build_marker_trace_headers()to generateX-Marker-*headers (source file/path, processor, page, block, and extra context). These are extremely useful for tracing LLM calls to debug intermediate steps as well as monitor per-step latency.sanitize_headers()to safely strip/clip header valuesmarker/processors/llm/llm_utils.py:inject_analysis_prompt()withanalysis_stylesupport (summary/deep/auto)strip_code_fences()andstring_indicates_no_corrections()helpers for more robust response parsinganalysis_styleoptions for LLM processors (opt-in “deeper thinking” prompts), used in:LLMTableProcessor(marker/processors/llm/llm_table.py)LLMPageCorrectionProcessor(marker/processors/llm/llm_page_correction.py)LLMSectionHeaderProcessor(marker/processors/llm/llm_sectionheader.py)"analysis"output field of the schema, we can drastically improve accuracy of results.LLMSectionHeaderProcessorchunking + additional context knobs (all default to “off” to preserve existing behavior):max_chunk_tokens,chunk_tokenizer_hf_model_id(best-effort HF tokenizer based token counting)neighbor_text_max_blocks,neighbor_text_max_chars(inject nearby text around each header)recent_headers_max_count(carry forward recently-fixed headers as anchoring context)max_rewrite_retries+ ascorefield to optionally retry low-confidence chunksChanged
LLMTableProcessor,LLMMathBlockProcessor,LLMFormProcessor,LLMPageCorrectionProcessor,LLMSectionHeaderProcessorcorrection_needed: boolfields that can be quickly checked before further parsing.BaseService.__call__(..., extra_headers: Mapping[str, str] | None = None)inmarker/services/__init__.pyextra_headers:OpenAIService,AzureOpenAIService,ClaudeService,BaseGeminiService(andGoogleVertexService),OllamaServiceOpenAIServiceandAzureOpenAIServicepass/sanitizeextra_headersand attach default request headers (X-Title,HTTP-Referer) plusX-Marker-Blockwhen availableLLMTableProcessor:TableChunk,RowStart/RowEnd,Iteration)table_image_expansion_ratio→image_expansion_ratio(note: config key rename)HTMLProviderstoressource_filepathfor easier debugging/tracing (marker/providers/html.py)Fixed
LLMPageCorrectionProcessorresponse handling is more robust (normalizes string/list/dict responses, handles code fences and “no corrections” phrases, infers correction type when missing).PageGroup.get_image()now consistently returns a PILImage.Imageeven when the stored image is bytes/memoryview (marker/schema/groups/page.py)PromptData.schematyping corrected totype[BaseModel](marker/processors/llm/__init__.py)Blocknow includes an optionalhtmlfield (used by processors likeLLMFormProcessor) (marker/schema/blocks/base.py)LLMTableProcessor,LLMMathBlockProcessor, andLLMFormProcessor(better schema adherence, code-fence stripping, “no corrections” detection)Notes / Migration
extra_headerskwarg (added toBaseService.__call__).LLMTableProcessor.table_image_expansion_ratio(this was most likely a typo bug that was not fixed), migrate toLLMTableProcessor.image_expansion_ratio.Other notes
I implemented a custom OpenAIService wrapper at https://github.com/linminhtoo/finance-rag-assistant-v2/blob/mlin/v1.1/scripts/process_html_to_markdown.py#L46
markerretry loop actually overlaps with the internalOpenAISDK's retry mechanism, causing excessive requests to the API. We may want to either set the API SDK's retries to 0 and use the marker retry mechanism, or vice versa.In my opinion, the most brittle step at the moment is the
LLMSectionHeaderProcessor.Also, I started from #962 and removed linting (ruff formatting) commits which were a result of running
pre-commit run --all. I suggest we have a separate dedicated PR just to run this ruff formatting.