Fix missed signature field promotion for signature-labeled forms#35
Conversation
jbarrow
left a comment
There was a problem hiding this comment.
Thank you! Overall a welcome and great addition. Requested a few changes to try and minimize the impact (limit # of reads, only run when it will change something).
| doc.document.close() | ||
|
|
||
|
|
||
| @dataclass |
There was a problem hiding this comment.
Could you make this a pydantic BaseModel in utils.py?
There was a problem hiding this comment.
It might also be something that gets attached to the Page object, alongside the PIL image.
There was a problem hiding this comment.
Done. Moved TextFragment into utils.py as a BaseModel
and attached text_fragments to Page.
|
|
||
|
|
||
| def extract_text_fragments(input_path: str | Path) -> dict[int, list[TextFragment]]: | ||
| reader = pypdf.PdfReader(str(input_path)) |
There was a problem hiding this comment.
We're already reading the PDF with pypdf in the PyPDFFormCreator, I wonder if there's a way to reuse that and avoid doing the reread here.
Alternatively, could use pypdfium2 to get the text runs (I think this will be more robust than pypdf, personally), and reuse the reading in the render function.
There was a problem hiding this comment.
Done. Switched this to reuse pypdfium2 during render_pdf()
instead of doing a separate pypdf read.
| signature_labels = [ | ||
| fragment | ||
| for fragment in text_fragments.get(page_ix, []) | ||
| if "signature" in fragment.text.lower() |
There was a problem hiding this comment.
It would be nice to not hardcode this, mostly to support/consider other languages (e.g. "Unterschrift" in German).
There was a problem hiding this comment.
Done. This now uses configurable
signature_label_terms instead of hardcoding "signature".
| results = detector.extract_widgets( | ||
| pages, confidence=confidence, image_size=image_size | ||
| ) | ||
| results = promote_signature_widgets(input_path, results) |
There was a problem hiding this comment.
This should only be run if use_signature_fields is True.
There was a problem hiding this comment.
The promotion fallback now only runs when
use_signature_fields=True.
|
Pushed an update that addresses the review comments and the CI device Changes:
I reran |
varun3011
left a comment
There was a problem hiding this comment.
Pushed an update that addresses the review comments and the CI device
regression.
Changes:
- moved
TextFragmentintoutils.pyas aBaseModel - attached text fragments to
Page - switched text extraction to reuse
pypdfium2duringrender_pdf() - only run signature promotion when
use_signature_fields=True - made signature label terms configurable via
signature_label_terms - fixed
FFDetrdevice handling so it stays on the requested device in CI/
local runs
I reran uv run -m pytest tests/inference_test.py locally and all 7 tests
pass.
| doc.document.close() | ||
|
|
||
|
|
||
| @dataclass |
There was a problem hiding this comment.
Done. Moved TextFragment into utils.py as a BaseModel
and attached text_fragments to Page.
|
|
||
|
|
||
| def extract_text_fragments(input_path: str | Path) -> dict[int, list[TextFragment]]: | ||
| reader = pypdf.PdfReader(str(input_path)) |
There was a problem hiding this comment.
Done. Switched this to reuse pypdfium2 during render_pdf()
instead of doing a separate pypdf read.
| signature_labels = [ | ||
| fragment | ||
| for fragment in text_fragments.get(page_ix, []) | ||
| if "signature" in fragment.text.lower() |
There was a problem hiding this comment.
Done. This now uses configurable
signature_label_terms instead of hardcoding "signature".
| results = detector.extract_widgets( | ||
| pages, confidence=confidence, image_size=image_size | ||
| ) | ||
| results = promote_signature_widgets(input_path, results) |
There was a problem hiding this comment.
The promotion fallback now only runs when
use_signature_fields=True.

Fixes cases where
--use-signature-fieldsdoes not produce/Sigfields evenwhen the PDF contains a clear signature label and line.
What changed
prepare_formpypdfsignatureSignaturewhen thedetector missed the signature class entirely
Why
The detector can classify signature areas as regular text boxes. This change
adds a fallback that uses document text/layout to recover likely signature
fields without requiring model retraining.
Notes
This is a post-processing fix, not a model-weight change. It depends on PDFs
having an extractable text layer near the signature label, so scanned/image-
only PDFs are still a limitation.