Skip to content

char.fontname could be bytes #94

@qudansdl

Description

@qudansdl

Initial Checks

  • I confirm that I'm on the latest version

Description

I got following error

ERROR:root:1 validation error for CharElement fontname Input should be a valid string, unable to parse raw data as a unicode string [type=string_unicode, input_value=b'ELPAGO+*\xc7\xd1\xbe\xe...xb0\xdf\xb0\xed\xb5\xf1', input_type=bytes] For further information visit https://errors.pydantic.dev/2.10/v/string_unicode Traceback (most recent call last): File "\backend\controller\notes.py", line 75, in create_note parsed_content = parser.parse(file_path) ^^^^^^^^^^^^^^^^^^^^^^^ File "\backend\.venv\Lib\site-packages\openparse\doc_parser.py", line 100, in parse text_elems = text.ingest(doc, parsing_method=text_engine) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "\backend\.venv\Lib\site-packages\openparse\text\parse.py", line 19, in ingest return pdfminer.ingest(doc) ^^^^^^^^^^^^^^^^^^^^ File "\backend\.venv\Lib\site-packages\openparse\text\pdfminer\core.py", line 235, in ingest lines.append(_create_line_element(text_line)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "\backend\.venv\Lib\site-packages\openparse\text\pdfminer\core.py", line 157, in _create_line_element chars = _extract_chars(text_line) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "\backend\.venv\Lib\site-packages\openparse\text\pdfminer\core.py", line 76, in _extract_chars CharElement(text=char.get_text(), fontname=last_fontname, size=last_size) File "\backend\.venv\Lib\site-packages\pydantic\main.py", line 214, in __init__ validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ pydantic_core._pydantic_core.ValidationError: 1 validation error for CharElement fontname Input should be a valid string, unable to parse raw data as a unicode string [type=string_unicode, input_value=b'ELPAGO+*\xc7\xd1\xbe\xe...xb0\xdf\xb0\xed\xb5\xf1', input_type=bytes] For further information visit https://errors.pydantic.dev/2.10/v/string_unicode

In the case of Korean PDF, there is a case where the data type of char.fontname is bytes in line 64 of the text/pdfminer/core.py file.

Example Code

semantic_pipeline = processing.SemanticIngestionPipeline(
            openai_api_key="",
            model="text-embedding-3-large",
            min_tokens=64,
            max_tokens=1024,
        )
        parser = DocumentParser(
            processing_pipeline=semantic_pipeline,
        )
        parsed_content = parser.parse(file_path)

Python, open-parse & OS Version

python_version: 3.12.9
             operating_system: Windows
                   os_version: 11
           open-parse version: 0.7.0
                 install path: \backend\.venv\Lib\site-packages\openparse
               python version: 3.12.9 (tags/v3.12.9:fdb8142, Feb  4 2025, 15:27:58) [MSC v.1942 64 bit (AMD64)]
                     platform: Windows-11-10.0.26100-SP0
             related packages: pydantic-2.10.6 PyMuPDF-1.25.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions