feat(research): OCR-IQA correlation analysis, tier1 expansion, and SmartDoc-QA study#36
feat(research): OCR-IQA correlation analysis, tier1 expansion, and SmartDoc-QA study#36williaby wants to merge 2 commits into
Conversation
…d SmartDoc-QA study - FCA and ensemble-spread analysis scripts + reports (ocr_iqa_correlation) - expand CER/WER analysis and tests; update OCR-IQA correlation doc and paper - stream3 VLM consensus + tier1 build improvements; tier1 LoRA training script - SmartDoc-QA OCR analysis pipeline (scripts, data metas) and dataset doc - tier1 expansion report Large generated outputs (fca_per_image.jsonl, smartdoc_qa_error_rates.jsonl) left untracked; they are regenerable from the committed scripts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 54 minutes and 50 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (24)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
|
There was a problem hiding this comment.
Pull request overview
This PR adds research pipelines, reports, and documentation for OCR-IQA correlation expansion, Tier 1 VLM consensus labeling, and SmartDoc-QA OCR analysis.
Changes:
- Adds FCA and ensemble-spread OCR-IQA analysis scripts, reports, tests, and paper updates.
- Expands Tier 1 Stream 3 VLM consensus labeling and adds Tier 1 training/report documentation.
- Adds SmartDoc-QA metadata generation, DeQA inference, OCR error-rate parsing, and correlation analysis.
Reviewed changes
Copilot reviewed 18 out of 24 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
.gitignore |
Ignores regenerable research JSONL outputs. |
DeQA-Score/scripts/train_tier1_lora.sh |
Adds Tier 1 LoRA training script. |
DeQA-Score/src/expansion/build_tier1.py |
Integrates default Stream 3 source generation. |
DeQA-Score/src/expansion/stream3_vlm_consensus.py |
Reworks VLM consensus labeling with defaults, parallelism, and checkpointing. |
docs/datasets/ocr-iqa-correlation.md |
Updates OCR-IQA dataset status, engines, metrics, and results. |
docs/datasets/smartdoc-qa.md |
Adds SmartDoc-QA dataset documentation. |
docs/tier1-expansion-report.md |
Adds Tier 1 expansion report and training configuration. |
research/ocr_iqa_correlation/analysis/cer_wer.py |
Adds FCA computation and optional metric inclusion. |
research/ocr_iqa_correlation/outputs/ensemble_spread_report.json |
Adds generated ensemble/spread report. |
research/ocr_iqa_correlation/outputs/fca_analysis_report.json |
Adds generated FCA analysis report. |
research/ocr_iqa_correlation/scripts/07_fca_analysis.py |
Adds FCA recomputation and correlation script. |
research/ocr_iqa_correlation/scripts/08_ensemble_spread_analysis.py |
Adds multi-engine ensemble and spread analysis script. |
research/ocr_iqa_correlation/tests/test_cer_wer.py |
Adds FCA unit tests and compute_metrics coverage. |
research/papers/06_ocr_iqa_correlation/paper.md |
Expands paper with ensemble, VLM, and FCA findings. |
research/smartdoc_qa_ocr_analysis/01_build_meta_json.py |
Adds SmartDoc-QA DeQA manifest generation. |
research/smartdoc_qa_ocr_analysis/02_run_deqa.sh |
Adds SmartDoc-QA DeQA inference runner. |
research/smartdoc_qa_ocr_analysis/03_compute_correlation.py |
Adds SmartDoc-QA MOS/OCR correlation analysis. |
research/smartdoc_qa_ocr_analysis/compute_error_rates.py |
Adds SmartDoc-QA UNLV-ISRI accuracy parsing. |
research/smartdoc_qa_ocr_analysis/data/smartdoc_qa_meta_04.json |
Adds generated SmartDoc-QA metadata chunk. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| total_items=total, | ||
| errors=errors, | ||
| accuracy_pct=acc_pct, | ||
| error_rate=1.0 - acc_pct / 100.0, |
| total_items=total, | ||
| errors=errors, | ||
| accuracy_pct=acc_pct, | ||
| error_rate=1.0 - acc_pct / 100.0, |
| result = label_single_image(image_path, api_key) | ||
| _append_checkpoint(checkpoint_path, result, source.name) | ||
|
|
||
| # Copy image to output if successful | ||
| if result.consensus_mos is not None and not result.excluded: | ||
| import shutil | ||
|
|
||
| images_out = output_dir / "images" / f"stream3_{source.name}" | ||
| images_out.mkdir(parents=True, exist_ok=True) | ||
| out_filename = f"{source.name}_{result.image_id}{image_path.suffix}" | ||
| out_path = images_out / out_filename | ||
| if not out_path.exists(): | ||
| shutil.copyfile(image_path, out_path) |
| completed_ids = _load_checkpoint(checkpoint_path) if resume else set() | ||
|
|
||
| # Build work queue (skip already checkpointed) | ||
| work_items: list[tuple[SourceDataset, Path]] = [] | ||
| for source, images in all_work: | ||
| pending = [p for p in images if p.stem not in completed_ids] |
| OHR-Bench 700 (train only — val/test reserved) | ||
| Tobacco800 500 (all — no reserved splits) | ||
| SmartDoc-QA 500 (train only — val/test reserved; filename-based split) | ||
| RealDAE 400 (all — no reserved splits; _in images only) | ||
| FUNSD+ 300 (train only — test reserved; filename-based split) | ||
| OCR-Quality 200 (all — no reserved splits) | ||
| SROIE 100 (train only — test reserved) | ||
| Total: 2,700 samples |
| ) | ||
| for s in config.stream3_sources | ||
| ] | ||
| from .stream3_vlm_consensus import TIER1_SOURCES, SourceDataset, generate_stream3 |
| | Spread (std CER) vs. MOS | +0.278 | < 10^-22 | | ||
| | Spread (std CER) vs. mean CER | -0.474 | < 10^-67 | | ||
|
|
||
| The positive SRCC (+0.278) between spread and MOS indicates that higher-quality images produce more inter-engine agreement — engines converge on similar CER when the image is clean, and diverge when degraded. However, the correlation is weak, suggesting limited utility as a standalone quality signal. The stronger negative correlation between spread and mean CER (-0.474) indicates that spread partially tracks overall error level rather than providing orthogonal information. |
| - **MOS scale compression**: DeQA MOS ranges only 2.94-3.35 across tiers (narrow dynamic range) | ||
| - **High baseline CER**: All engines show CER > 0.28 even on original images, likely due to form-specific OCR challenges (checkboxes, tables, handwriting) | ||
| - **DeepSeek-OCR2 hallucination**: CER > 1.0 on most tiers due to HTML-table output format | ||
| - **No FCA metric**: Reading-order-sensitive CER may inflate errors when OCR engines produce different text-block segmentation |
| # Unmatched hypothesis lines contribute CER = 1.0 each, | ||
| # weighted by the ratio of unmatched hyp chars to total ref chars | ||
| unmatched_hyp_count = len(hyp_lines) - len(used_hyp) | ||
| for _ in range(unmatched_hyp_count): | ||
| matched_cers.append(1.0) |




Summary
Adds research work that was previously uncommitted in the working tree, now separated onto its own branch.
Changes
train_tier1_lora.shtraining script; tier1 expansion report.Notes
fca_per_image.jsonl,smartdoc_qa_error_rates.jsonl) are gitignored (research/**/outputs/*.jsonl), kept on disk, out of history. Small.jsonreports remain tracked.Generated with Claude Code