Fix missed signature field promotion for signature-labeled forms by varun3011 · Pull Request #35 · jbarrow/commonforms

varun3011 · 2026-05-20T21:08:38Z

Fixes cases where --use-signature-fields does not produce /Sig fields even
when the PDF contains a clear signature label and line.

What changed

added layout-aware post-processing in prepare_form
extract positioned PDF text with pypdf
find text fragments containing signature
group detected text boxes into aligned rows
promote the best-matching row’s leftmost field to Signature when the
detector missed the signature class entirely
added regression tests for:
- promotion on the bundled test PDF
- no promotion on a page without a signature label

Why

The detector can classify signature areas as regular text boxes. This change
adds a fallback that uses document text/layout to recover likely signature
fields without requiring model retraining.

Notes

This is a post-processing fix, not a model-weight change. It depends on PDFs
having an extractable text layer near the signature label, so scanned/image-
only PDFs are still a limitation.

jbarrow

Thank you! Overall a welcome and great addition. Requested a few changes to try and minimize the impact (limit # of reads, only run when it will change something).

jbarrow · 2026-05-20T22:49:30Z

        doc.document.close()


+@dataclass


Could you make this a pydantic BaseModel in utils.py?

It might also be something that gets attached to the Page object, alongside the PIL image.

Done. Moved TextFragment into utils.py as a BaseModel
and attached text_fragments to Page.

jbarrow · 2026-05-20T22:52:52Z

+
+
+def extract_text_fragments(input_path: str | Path) -> dict[int, list[TextFragment]]:
+    reader = pypdf.PdfReader(str(input_path))


We're already reading the PDF with pypdf in the PyPDFFormCreator, I wonder if there's a way to reuse that and avoid doing the reread here.

Alternatively, could use pypdfium2 to get the text runs (I think this will be more robust than pypdf, personally), and reuse the reading in the render function.

Done. Switched this to reuse pypdfium2 during render_pdf()
instead of doing a separate pypdf read.

jbarrow · 2026-05-20T22:54:39Z

+        signature_labels = [
+            fragment
+            for fragment in text_fragments.get(page_ix, [])
+            if "signature" in fragment.text.lower()


It would be nice to not hardcode this, mostly to support/consider other languages (e.g. "Unterschrift" in German).

Done. This now uses configurable
signature_label_terms instead of hardcoding "signature".

jbarrow · 2026-05-20T22:56:01Z

        results = detector.extract_widgets(
            pages, confidence=confidence, image_size=image_size
        )
+    results = promote_signature_widgets(input_path, results)


This should only be run if use_signature_fields is True.

The promotion fallback now only runs when
use_signature_fields=True.

varun3011 · 2026-05-20T23:17:00Z

Pushed an update that addresses the review comments and the CI device
regression.

Changes:

moved TextFragment into utils.py as a BaseModel
attached text fragments to Page
switched text extraction to reuse pypdfium2 during render_pdf()
only run signature promotion when use_signature_fields=True
made signature label terms configurable via signature_label_terms
fixed FFDetr device handling so it stays on the requested device in CI/
local runs

I reran uv run -m pytest tests/inference_test.py locally and all 7 tests
pass.

varun3011

Pushed an update that addresses the review comments and the CI device
regression.

Changes:

moved TextFragment into utils.py as a BaseModel
attached text fragments to Page
switched text extraction to reuse pypdfium2 during render_pdf()
only run signature promotion when use_signature_fields=True
made signature label terms configurable via signature_label_terms
fixed FFDetr device handling so it stays on the requested device in CI/
local runs

I reran uv run -m pytest tests/inference_test.py locally and all 7 tests
pass.

varun3011 · 2026-05-20T23:20:04Z

        doc.document.close()


+@dataclass


Done. Moved TextFragment into utils.py as a BaseModel
and attached text_fragments to Page.

varun3011 · 2026-05-20T23:20:15Z

+
+
+def extract_text_fragments(input_path: str | Path) -> dict[int, list[TextFragment]]:
+    reader = pypdf.PdfReader(str(input_path))


Done. Switched this to reuse pypdfium2 during render_pdf()
instead of doing a separate pypdf read.

varun3011 · 2026-05-20T23:20:26Z

+        signature_labels = [
+            fragment
+            for fragment in text_fragments.get(page_ix, [])
+            if "signature" in fragment.text.lower()


Done. This now uses configurable
signature_label_terms instead of hardcoding "signature".

varun3011 · 2026-05-20T23:20:38Z

        results = detector.extract_widgets(
            pages, confidence=confidence, image_size=image_size
        )
+    results = promote_signature_widgets(input_path, results)


The promotion fallback now only runs when
use_signature_fields=True.

jbarrow · 2026-05-21T00:05:54Z

Tests passing locally, looks good with both FFDetr and FFDNet. Merging.

Interestingly, I see this case (attached PDF and image) where it correctly fixes one signature field but not the other.

[fillable.pdf](https://github.com/user-attachments/files/28079290/fillable.pdf)

Fix missed signature field promotion

7c6d01c

jbarrow reviewed May 20, 2026

View reviewed changes

Address review feedback and fix FFDetr device handling

956d9d4

varun3011 commented May 20, 2026

View reviewed changes

jbarrow merged commit 220b267 into jbarrow:main May 21, 2026
3 of 4 checks passed



		def extract_text_fragments(input_path: str \| Path) -> dict[int, list[TextFragment]]:
		reader = pypdf.PdfReader(str(input_path))

		doc.document.close()


		@dataclass

		doc.document.close()


		@dataclass

Conversation

varun3011 commented May 20, 2026

What changed

Why

Notes

Uh oh!

jbarrow left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

varun3011 commented May 20, 2026

Uh oh!

varun3011 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbarrow commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants