Skip to content

feat(research): OCR-IQA correlation analysis, tier1 expansion, and SmartDoc-QA study#36

Open
williaby wants to merge 2 commits into
mainfrom
research/ocr-iqa-correlation-expansion
Open

feat(research): OCR-IQA correlation analysis, tier1 expansion, and SmartDoc-QA study#36
williaby wants to merge 2 commits into
mainfrom
research/ocr-iqa-correlation-expansion

Conversation

@williaby

Copy link
Copy Markdown
Contributor

Summary

Adds research work that was previously uncommitted in the working tree, now separated onto its own branch.

Changes

  • OCR-IQA correlation: FCA and ensemble-spread analysis scripts + reports; expanded CER/WER analysis and tests; updated correlation doc and paper.
  • Tier1 expansion: stream3 VLM consensus + tier1 build improvements; train_tier1_lora.sh training script; tier1 expansion report.
  • SmartDoc-QA: OCR analysis pipeline (scripts, data metas) and dataset doc.

Notes

  • Large regenerable outputs (fca_per_image.jsonl, smartdoc_qa_error_rates.jsonl) are gitignored (research/**/outputs/*.jsonl), kept on disk, out of history. Small .json reports remain tracked.

Generated with Claude Code

williaby and others added 2 commits May 28, 2026 15:36
…d SmartDoc-QA study

- FCA and ensemble-spread analysis scripts + reports (ocr_iqa_correlation)
- expand CER/WER analysis and tests; update OCR-IQA correlation doc and paper
- stream3 VLM consensus + tier1 build improvements; tier1 LoRA training script
- SmartDoc-QA OCR analysis pipeline (scripts, data metas) and dataset doc
- tier1 expansion report

Large generated outputs (fca_per_image.jsonl, smartdoc_qa_error_rates.jsonl)
left untracked; they are regenerable from the committed scripts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 29, 2026 01:27
@coderabbitai

coderabbitai Bot commented May 29, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@williaby, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 54 minutes and 50 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7e13f6b-7f02-4684-a4b0-ea4569f2e88d

📥 Commits

Reviewing files that changed from the base of the PR and between 1cfd40f and 2b4741a.

📒 Files selected for processing (24)
  • .gitignore
  • DeQA-Score/scripts/train_tier1_lora.sh
  • DeQA-Score/src/expansion/build_tier1.py
  • DeQA-Score/src/expansion/stream3_vlm_consensus.py
  • docs/datasets/ocr-iqa-correlation.md
  • docs/datasets/smartdoc-qa.md
  • docs/tier1-expansion-report.md
  • research/ocr_iqa_correlation/analysis/cer_wer.py
  • research/ocr_iqa_correlation/outputs/ensemble_spread_report.json
  • research/ocr_iqa_correlation/outputs/fca_analysis_report.json
  • research/ocr_iqa_correlation/scripts/07_fca_analysis.py
  • research/ocr_iqa_correlation/scripts/08_ensemble_spread_analysis.py
  • research/ocr_iqa_correlation/tests/test_cer_wer.py
  • research/papers/06_ocr_iqa_correlation/paper.md
  • research/smartdoc_qa_ocr_analysis/01_build_meta_json.py
  • research/smartdoc_qa_ocr_analysis/02_run_deqa.sh
  • research/smartdoc_qa_ocr_analysis/03_compute_correlation.py
  • research/smartdoc_qa_ocr_analysis/compute_error_rates.py
  • research/smartdoc_qa_ocr_analysis/data/smartdoc_qa_meta_00.json
  • research/smartdoc_qa_ocr_analysis/data/smartdoc_qa_meta_01.json
  • research/smartdoc_qa_ocr_analysis/data/smartdoc_qa_meta_02.json
  • research/smartdoc_qa_ocr_analysis/data/smartdoc_qa_meta_03.json
  • research/smartdoc_qa_ocr_analysis/data/smartdoc_qa_meta_04.json
  • research/smartdoc_qa_ocr_analysis/data/smartdoc_qa_meta_all.json
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch research/ocr-iqa-correlation-expansion

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
7 Security Hotspots
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds research pipelines, reports, and documentation for OCR-IQA correlation expansion, Tier 1 VLM consensus labeling, and SmartDoc-QA OCR analysis.

Changes:

  • Adds FCA and ensemble-spread OCR-IQA analysis scripts, reports, tests, and paper updates.
  • Expands Tier 1 Stream 3 VLM consensus labeling and adds Tier 1 training/report documentation.
  • Adds SmartDoc-QA metadata generation, DeQA inference, OCR error-rate parsing, and correlation analysis.

Reviewed changes

Copilot reviewed 18 out of 24 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
.gitignore Ignores regenerable research JSONL outputs.
DeQA-Score/scripts/train_tier1_lora.sh Adds Tier 1 LoRA training script.
DeQA-Score/src/expansion/build_tier1.py Integrates default Stream 3 source generation.
DeQA-Score/src/expansion/stream3_vlm_consensus.py Reworks VLM consensus labeling with defaults, parallelism, and checkpointing.
docs/datasets/ocr-iqa-correlation.md Updates OCR-IQA dataset status, engines, metrics, and results.
docs/datasets/smartdoc-qa.md Adds SmartDoc-QA dataset documentation.
docs/tier1-expansion-report.md Adds Tier 1 expansion report and training configuration.
research/ocr_iqa_correlation/analysis/cer_wer.py Adds FCA computation and optional metric inclusion.
research/ocr_iqa_correlation/outputs/ensemble_spread_report.json Adds generated ensemble/spread report.
research/ocr_iqa_correlation/outputs/fca_analysis_report.json Adds generated FCA analysis report.
research/ocr_iqa_correlation/scripts/07_fca_analysis.py Adds FCA recomputation and correlation script.
research/ocr_iqa_correlation/scripts/08_ensemble_spread_analysis.py Adds multi-engine ensemble and spread analysis script.
research/ocr_iqa_correlation/tests/test_cer_wer.py Adds FCA unit tests and compute_metrics coverage.
research/papers/06_ocr_iqa_correlation/paper.md Expands paper with ensemble, VLM, and FCA findings.
research/smartdoc_qa_ocr_analysis/01_build_meta_json.py Adds SmartDoc-QA DeQA manifest generation.
research/smartdoc_qa_ocr_analysis/02_run_deqa.sh Adds SmartDoc-QA DeQA inference runner.
research/smartdoc_qa_ocr_analysis/03_compute_correlation.py Adds SmartDoc-QA MOS/OCR correlation analysis.
research/smartdoc_qa_ocr_analysis/compute_error_rates.py Adds SmartDoc-QA UNLV-ISRI accuracy parsing.
research/smartdoc_qa_ocr_analysis/data/smartdoc_qa_meta_04.json Adds generated SmartDoc-QA metadata chunk.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

total_items=total,
errors=errors,
accuracy_pct=acc_pct,
error_rate=1.0 - acc_pct / 100.0,
total_items=total,
errors=errors,
accuracy_pct=acc_pct,
error_rate=1.0 - acc_pct / 100.0,
Comment on lines +680 to +692
result = label_single_image(image_path, api_key)
_append_checkpoint(checkpoint_path, result, source.name)

# Copy image to output if successful
if result.consensus_mos is not None and not result.excluded:
import shutil

images_out = output_dir / "images" / f"stream3_{source.name}"
images_out.mkdir(parents=True, exist_ok=True)
out_filename = f"{source.name}_{result.image_id}{image_path.suffix}"
out_path = images_out / out_filename
if not out_path.exists():
shutil.copyfile(image_path, out_path)
Comment on lines +855 to +860
completed_ids = _load_checkpoint(checkpoint_path) if resume else set()

# Build work queue (skip already checkpointed)
work_items: list[tuple[SourceDataset, Path]] = []
for source, images in all_work:
pending = [p for p in images if p.stem not in completed_ids]
Comment on lines +8 to +15
OHR-Bench 700 (train only — val/test reserved)
Tobacco800 500 (all — no reserved splits)
SmartDoc-QA 500 (train only — val/test reserved; filename-based split)
RealDAE 400 (all — no reserved splits; _in images only)
FUNSD+ 300 (train only — test reserved; filename-based split)
OCR-Quality 200 (all — no reserved splits)
SROIE 100 (train only — test reserved)
Total: 2,700 samples
)
for s in config.stream3_sources
]
from .stream3_vlm_consensus import TIER1_SOURCES, SourceDataset, generate_stream3
| Spread (std CER) vs. MOS | +0.278 | < 10^-22 |
| Spread (std CER) vs. mean CER | -0.474 | < 10^-67 |

The positive SRCC (+0.278) between spread and MOS indicates that higher-quality images produce more inter-engine agreement — engines converge on similar CER when the image is clean, and diverge when degraded. However, the correlation is weak, suggesting limited utility as a standalone quality signal. The stronger negative correlation between spread and mean CER (-0.474) indicates that spread partially tracks overall error level rather than providing orthogonal information.
- **MOS scale compression**: DeQA MOS ranges only 2.94-3.35 across tiers (narrow dynamic range)
- **High baseline CER**: All engines show CER > 0.28 even on original images, likely due to form-specific OCR challenges (checkboxes, tables, handwriting)
- **DeepSeek-OCR2 hallucination**: CER > 1.0 on most tiers due to HTML-table output format
- **No FCA metric**: Reading-order-sensitive CER may inflate errors when OCR engines produce different text-block segmentation
Comment on lines +235 to +239
# Unmatched hypothesis lines contribute CER = 1.0 each,
# weighted by the ratio of unmatched hyp chars to total ref chars
unmatched_hyp_count = len(hyp_lines) - len(used_hyp)
for _ in range(unmatched_hyp_count):
matched_cers.append(1.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants