Skip to content

Fix missed signature field promotion for signature-labeled forms#35

Merged
jbarrow merged 2 commits into
jbarrow:mainfrom
varun3011:fix-signature-promotion
May 21, 2026
Merged

Fix missed signature field promotion for signature-labeled forms#35
jbarrow merged 2 commits into
jbarrow:mainfrom
varun3011:fix-signature-promotion

Conversation

@varun3011
Copy link
Copy Markdown
Contributor

Fixes cases where --use-signature-fields does not produce /Sig fields even
when the PDF contains a clear signature label and line.

What changed

  • added layout-aware post-processing in prepare_form
  • extract positioned PDF text with pypdf
  • find text fragments containing signature
  • group detected text boxes into aligned rows
  • promote the best-matching row’s leftmost field to Signature when the
    detector missed the signature class entirely
  • added regression tests for:
    • promotion on the bundled test PDF
    • no promotion on a page without a signature label

Why

The detector can classify signature areas as regular text boxes. This change
adds a fallback that uses document text/layout to recover likely signature
fields without requiring model retraining.

Notes

This is a post-processing fix, not a model-weight change. It depends on PDFs
having an extractable text layer near the signature label, so scanned/image-
only PDFs are still a limitation.

Copy link
Copy Markdown
Owner

@jbarrow jbarrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Overall a welcome and great addition. Requested a few changes to try and minimize the impact (limit # of reads, only run when it will change something).

Comment thread commonforms/inference.py Outdated
doc.document.close()


@dataclass
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this a pydantic BaseModel in utils.py?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be something that gets attached to the Page object, alongside the PIL image.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Moved TextFragment into utils.py as a BaseModel
and attached text_fragments to Page.

Comment thread commonforms/inference.py Outdated


def extract_text_fragments(input_path: str | Path) -> dict[int, list[TextFragment]]:
reader = pypdf.PdfReader(str(input_path))
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're already reading the PDF with pypdf in the PyPDFFormCreator, I wonder if there's a way to reuse that and avoid doing the reread here.

Alternatively, could use pypdfium2 to get the text runs (I think this will be more robust than pypdf, personally), and reuse the reading in the render function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Switched this to reuse pypdfium2 during render_pdf()
instead of doing a separate pypdf read.

Comment thread commonforms/inference.py Outdated
signature_labels = [
fragment
for fragment in text_fragments.get(page_ix, [])
if "signature" in fragment.text.lower()
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to not hardcode this, mostly to support/consider other languages (e.g. "Unterschrift" in German).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. This now uses configurable
signature_label_terms instead of hardcoding "signature".

Comment thread commonforms/inference.py Outdated
results = detector.extract_widgets(
pages, confidence=confidence, image_size=image_size
)
results = promote_signature_widgets(input_path, results)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only be run if use_signature_fields is True.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The promotion fallback now only runs when
use_signature_fields=True.

Comment thread tests/inference_test.py
@varun3011
Copy link
Copy Markdown
Contributor Author

Pushed an update that addresses the review comments and the CI device
regression.

Changes:

  • moved TextFragment into utils.py as a BaseModel
  • attached text fragments to Page
  • switched text extraction to reuse pypdfium2 during render_pdf()
  • only run signature promotion when use_signature_fields=True
  • made signature label terms configurable via signature_label_terms
  • fixed FFDetr device handling so it stays on the requested device in CI/
    local runs

I reran uv run -m pytest tests/inference_test.py locally and all 7 tests
pass.

Copy link
Copy Markdown
Contributor Author

@varun3011 varun3011 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed an update that addresses the review comments and the CI device
regression.

Changes:

  • moved TextFragment into utils.py as a BaseModel
  • attached text fragments to Page
  • switched text extraction to reuse pypdfium2 during render_pdf()
  • only run signature promotion when use_signature_fields=True
  • made signature label terms configurable via signature_label_terms
  • fixed FFDetr device handling so it stays on the requested device in CI/
    local runs

I reran uv run -m pytest tests/inference_test.py locally and all 7 tests
pass.

Comment thread commonforms/inference.py Outdated
doc.document.close()


@dataclass
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Moved TextFragment into utils.py as a BaseModel
and attached text_fragments to Page.

Comment thread commonforms/inference.py Outdated


def extract_text_fragments(input_path: str | Path) -> dict[int, list[TextFragment]]:
reader = pypdf.PdfReader(str(input_path))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Switched this to reuse pypdfium2 during render_pdf()
instead of doing a separate pypdf read.

Comment thread commonforms/inference.py Outdated
signature_labels = [
fragment
for fragment in text_fragments.get(page_ix, [])
if "signature" in fragment.text.lower()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. This now uses configurable
signature_label_terms instead of hardcoding "signature".

Comment thread commonforms/inference.py Outdated
results = detector.extract_widgets(
pages, confidence=confidence, image_size=image_size
)
results = promote_signature_widgets(input_path, results)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The promotion fallback now only runs when
use_signature_fields=True.

@jbarrow
Copy link
Copy Markdown
Owner

jbarrow commented May 21, 2026

Tests passing locally, looks good with both FFDetr and FFDNet. Merging.

Interestingly, I see this case (attached PDF and image) where it correctly fixes one signature field but not the other.

Screenshot 2026-05-20 at 8 04 51 PM [fillable.pdf](https://github.com/user-attachments/files/28079290/fillable.pdf)

@jbarrow jbarrow merged commit 220b267 into jbarrow:main May 21, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants