char.fontname could be bytes

### Initial Checks

- [x] I confirm that I'm on the latest version

### Description

I got following error

`ERROR:root:1 validation error for CharElement
fontname
  Input should be a valid string, unable to parse raw data as a unicode string [type=string_unicode, input_value=b'ELPAGO+*\xc7\xd1\xbe\xe...xb0\xdf\xb0\xed\xb5\xf1', input_type=bytes]
    For further information visit https://errors.pydantic.dev/2.10/v/string_unicode
Traceback (most recent call last):
  File "\backend\controller\notes.py", line 75, in create_note
    parsed_content = parser.parse(file_path)
                     ^^^^^^^^^^^^^^^^^^^^^^^
  File "\backend\.venv\Lib\site-packages\openparse\doc_parser.py", line 100, in parse
    text_elems = text.ingest(doc, parsing_method=text_engine)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\backend\.venv\Lib\site-packages\openparse\text\parse.py", line 19, in ingest
    return pdfminer.ingest(doc)
           ^^^^^^^^^^^^^^^^^^^^
  File "\backend\.venv\Lib\site-packages\openparse\text\pdfminer\core.py", line 235, in ingest
    lines.append(_create_line_element(text_line))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\backend\.venv\Lib\site-packages\openparse\text\pdfminer\core.py", line 157, in _create_line_element
    chars = _extract_chars(text_line)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\backend\.venv\Lib\site-packages\openparse\text\pdfminer\core.py", line 76, in _extract_chars
    CharElement(text=char.get_text(), fontname=last_fontname, size=last_size)
  File "\backend\.venv\Lib\site-packages\pydantic\main.py", line 214, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for CharElement
fontname
  Input should be a valid string, unable to parse raw data as a unicode string [type=string_unicode, input_value=b'ELPAGO+*\xc7\xd1\xbe\xe...xb0\xdf\xb0\xed\xb5\xf1', input_type=bytes]
    For further information visit https://errors.pydantic.dev/2.10/v/string_unicode
`

In the case of Korean PDF, there is a case where the data type of char.fontname is bytes in line 64 of the text/pdfminer/core.py file.


### Example Code

```Python
semantic_pipeline = processing.SemanticIngestionPipeline(
            openai_api_key="",
            model="text-embedding-3-large",
            min_tokens=64,
            max_tokens=1024,
        )
        parser = DocumentParser(
            processing_pipeline=semantic_pipeline,
        )
        parsed_content = parser.parse(file_path)
```

### Python, open-parse & OS Version

```Text
python_version: 3.12.9
             operating_system: Windows
                   os_version: 11
           open-parse version: 0.7.0
                 install path: \backend\.venv\Lib\site-packages\openparse
               python version: 3.12.9 (tags/v3.12.9:fdb8142, Feb  4 2025, 15:27:58) [MSC v.1942 64 bit (AMD64)]
                     platform: Windows-11-10.0.26100-SP0
             related packages: pydantic-2.10.6 PyMuPDF-1.25.4
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

char.fontname could be bytes #94

Initial Checks

Description

Example Code

Python, open-parse & OS Version

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

char.fontname could be bytes #94

Description

Initial Checks

Description

Example Code

Python, open-parse & OS Version

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions