feat(research): OCR-IQA correlation analysis, tier1 expansion, and SmartDoc-QA study by williaby · Pull Request #36 · ByronWilliamsCPA/DeQA-Doc

williaby · 2026-05-29T01:27:04Z

Summary

Adds research work that was previously uncommitted in the working tree, now separated onto its own branch.

Changes

OCR-IQA correlation: FCA and ensemble-spread analysis scripts + reports; expanded CER/WER analysis and tests; updated correlation doc and paper.
Tier1 expansion: stream3 VLM consensus + tier1 build improvements; train_tier1_lora.sh training script; tier1 expansion report.
SmartDoc-QA: OCR analysis pipeline (scripts, data metas) and dataset doc.

Notes

Large regenerable outputs (fca_per_image.jsonl, smartdoc_qa_error_rates.jsonl) are gitignored (research/**/outputs/*.jsonl), kept on disk, out of history. Small .json reports remain tracked.

…d SmartDoc-QA study - FCA and ensemble-spread analysis scripts + reports (ocr_iqa_correlation) - expand CER/WER analysis and tests; update OCR-IQA correlation doc and paper - stream3 VLM consensus + tier1 build improvements; tier1 LoRA training script - SmartDoc-QA OCR analysis pipeline (scripts, data metas) and dataset doc - tier1 expansion report Large generated outputs (fca_per_image.jsonl, smartdoc_qa_error_rates.jsonl) left untracked; they are regenerable from the committed scripts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-05-29T01:27:11Z

Warning

Review limit reached

@williaby, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 54 minutes and 50 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7e13f6b-7f02-4684-a4b0-ea4569f2e88d

📥 Commits

Reviewing files that changed from the base of the PR and between 1cfd40f and 2b4741a.

📒 Files selected for processing (24)

.gitignore
DeQA-Score/scripts/train_tier1_lora.sh
DeQA-Score/src/expansion/build_tier1.py
DeQA-Score/src/expansion/stream3_vlm_consensus.py
docs/datasets/ocr-iqa-correlation.md
docs/datasets/smartdoc-qa.md
docs/tier1-expansion-report.md
research/ocr_iqa_correlation/analysis/cer_wer.py
research/ocr_iqa_correlation/outputs/ensemble_spread_report.json
research/ocr_iqa_correlation/outputs/fca_analysis_report.json
research/ocr_iqa_correlation/scripts/07_fca_analysis.py
research/ocr_iqa_correlation/scripts/08_ensemble_spread_analysis.py
research/ocr_iqa_correlation/tests/test_cer_wer.py
research/papers/06_ocr_iqa_correlation/paper.md
research/smartdoc_qa_ocr_analysis/01_build_meta_json.py
research/smartdoc_qa_ocr_analysis/02_run_deqa.sh
research/smartdoc_qa_ocr_analysis/03_compute_correlation.py
research/smartdoc_qa_ocr_analysis/compute_error_rates.py
research/smartdoc_qa_ocr_analysis/data/smartdoc_qa_meta_00.json
research/smartdoc_qa_ocr_analysis/data/smartdoc_qa_meta_01.json
research/smartdoc_qa_ocr_analysis/data/smartdoc_qa_meta_02.json
research/smartdoc_qa_ocr_analysis/data/smartdoc_qa_meta_03.json
research/smartdoc_qa_ocr_analysis/data/smartdoc_qa_meta_04.json
research/smartdoc_qa_ocr_analysis/data/smartdoc_qa_meta_all.json

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch research/ocr-iqa-correlation-expansion

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-29T01:27:20Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

sonarqubecloud · 2026-05-29T01:28:16Z

Quality Gate failed

Failed conditions
7 Security Hotspots
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Copilot

Pull request overview

This PR adds research pipelines, reports, and documentation for OCR-IQA correlation expansion, Tier 1 VLM consensus labeling, and SmartDoc-QA OCR analysis.

Changes:

Adds FCA and ensemble-spread OCR-IQA analysis scripts, reports, tests, and paper updates.
Expands Tier 1 Stream 3 VLM consensus labeling and adds Tier 1 training/report documentation.
Adds SmartDoc-QA metadata generation, DeQA inference, OCR error-rate parsing, and correlation analysis.

Reviewed changes

Copilot reviewed 18 out of 24 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`.gitignore`	Ignores regenerable research JSONL outputs.
`DeQA-Score/scripts/train_tier1_lora.sh`	Adds Tier 1 LoRA training script.
`DeQA-Score/src/expansion/build_tier1.py`	Integrates default Stream 3 source generation.
`DeQA-Score/src/expansion/stream3_vlm_consensus.py`	Reworks VLM consensus labeling with defaults, parallelism, and checkpointing.
`docs/datasets/ocr-iqa-correlation.md`	Updates OCR-IQA dataset status, engines, metrics, and results.
`docs/datasets/smartdoc-qa.md`	Adds SmartDoc-QA dataset documentation.
`docs/tier1-expansion-report.md`	Adds Tier 1 expansion report and training configuration.
`research/ocr_iqa_correlation/analysis/cer_wer.py`	Adds FCA computation and optional metric inclusion.
`research/ocr_iqa_correlation/outputs/ensemble_spread_report.json`	Adds generated ensemble/spread report.
`research/ocr_iqa_correlation/outputs/fca_analysis_report.json`	Adds generated FCA analysis report.
`research/ocr_iqa_correlation/scripts/07_fca_analysis.py`	Adds FCA recomputation and correlation script.
`research/ocr_iqa_correlation/scripts/08_ensemble_spread_analysis.py`	Adds multi-engine ensemble and spread analysis script.
`research/ocr_iqa_correlation/tests/test_cer_wer.py`	Adds FCA unit tests and compute_metrics coverage.
`research/papers/06_ocr_iqa_correlation/paper.md`	Expands paper with ensemble, VLM, and FCA findings.
`research/smartdoc_qa_ocr_analysis/01_build_meta_json.py`	Adds SmartDoc-QA DeQA manifest generation.
`research/smartdoc_qa_ocr_analysis/02_run_deqa.sh`	Adds SmartDoc-QA DeQA inference runner.
`research/smartdoc_qa_ocr_analysis/03_compute_correlation.py`	Adds SmartDoc-QA MOS/OCR correlation analysis.
`research/smartdoc_qa_ocr_analysis/compute_error_rates.py`	Adds SmartDoc-QA UNLV-ISRI accuracy parsing.
`research/smartdoc_qa_ocr_analysis/data/smartdoc_qa_meta_04.json`	Adds generated SmartDoc-QA metadata chunk.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+                        total_items=total,
+                        errors=errors,
+                        accuracy_pct=acc_pct,
+                        error_rate=1.0 - acc_pct / 100.0,


+                        total_items=total,
+                        errors=errors,
+                        accuracy_pct=acc_pct,
+                        error_rate=1.0 - acc_pct / 100.0,


+    result = label_single_image(image_path, api_key)
+    _append_checkpoint(checkpoint_path, result, source.name)
+
+    # Copy image to output if successful
+    if result.consensus_mos is not None and not result.excluded:
+        import shutil
+
+        images_out = output_dir / "images" / f"stream3_{source.name}"
+        images_out.mkdir(parents=True, exist_ok=True)
+        out_filename = f"{source.name}_{result.image_id}{image_path.suffix}"
+        out_path = images_out / out_filename
+        if not out_path.exists():
+            shutil.copyfile(image_path, out_path)


+    completed_ids = _load_checkpoint(checkpoint_path) if resume else set()
+
+    # Build work queue (skip already checkpointed)
+    work_items: list[tuple[SourceDataset, Path]] = []
+    for source, images in all_work:
+        pending = [p for p in images if p.stem not in completed_ids]


+    OHR-Bench     700  (train only — val/test reserved)
+    Tobacco800    500  (all — no reserved splits)
+    SmartDoc-QA   500  (train only — val/test reserved; filename-based split)
+    RealDAE       400  (all — no reserved splits; _in images only)
+    FUNSD+        300  (train only — test reserved; filename-based split)
+    OCR-Quality   200  (all — no reserved splits)
+    SROIE         100  (train only — test reserved)
+    Total:      2,700 samples


-            )
-            for s in config.stream3_sources
-        ]
+        from .stream3_vlm_consensus import TIER1_SOURCES, SourceDataset, generate_stream3


+| Spread (std CER) vs. MOS | +0.278 | < 10^-22 |
+| Spread (std CER) vs. mean CER | -0.474 | < 10^-67 |
+
+The positive SRCC (+0.278) between spread and MOS indicates that higher-quality images produce more inter-engine agreement — engines converge on similar CER when the image is clean, and diverge when degraded. However, the correlation is weak, suggesting limited utility as a standalone quality signal. The stronger negative correlation between spread and mean CER (-0.474) indicates that spread partially tracks overall error level rather than providing orthogonal information.


+- **MOS scale compression**: DeQA MOS ranges only 2.94-3.35 across tiers (narrow dynamic range)
+- **High baseline CER**: All engines show CER > 0.28 even on original images, likely due to form-specific OCR challenges (checkboxes, tables, handwriting)
+- **DeepSeek-OCR2 hallucination**: CER > 1.0 on most tiers due to HTML-table output format
+- **No FCA metric**: Reading-order-sensitive CER may inflate errors when OCR engines produce different text-block segmentation


+    # Unmatched hypothesis lines contribute CER = 1.0 each,
+    # weighted by the ratio of unmatched hyp chars to total ref chars
+    unmatched_hyp_count = len(hyp_lines) - len(used_hyp)
+    for _ in range(unmatched_hyp_count):
+        matched_cers.append(1.0)


williaby and others added 2 commits May 28, 2026 15:36

chore: gitignore large regenerable analysis outputs (*.jsonl)

2b4741a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 29, 2026 01:27

Copilot started reviewing on behalf of williaby May 29, 2026 01:27 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(research): OCR-IQA correlation analysis, tier1 expansion, and SmartDoc-QA study#36

feat(research): OCR-IQA correlation analysis, tier1 expansion, and SmartDoc-QA study#36
williaby wants to merge 2 commits into
mainfrom
research/ocr-iqa-correlation-expansion

williaby commented May 29, 2026

Uh oh!

coderabbitai Bot commented May 29, 2026

Review limit reached

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

sonarqubecloud Bot commented May 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

williaby commented May 29, 2026

Summary

Changes

Notes

Uh oh!

coderabbitai Bot commented May 29, 2026

Review limit reached

Uh oh!

github-actions Bot commented May 29, 2026

Dependency Review

Scanned Files

Uh oh!

sonarqubecloud Bot commented May 29, 2026

Quality Gate failed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants