-
Notifications
You must be signed in to change notification settings - Fork 140
Description
Initial Checks
- I confirm that I'm on the latest version
Description
I got following error
ERROR:root:1 validation error for CharElement fontname Input should be a valid string, unable to parse raw data as a unicode string [type=string_unicode, input_value=b'ELPAGO+*\xc7\xd1\xbe\xe...xb0\xdf\xb0\xed\xb5\xf1', input_type=bytes] For further information visit https://errors.pydantic.dev/2.10/v/string_unicode Traceback (most recent call last): File "\backend\controller\notes.py", line 75, in create_note parsed_content = parser.parse(file_path) ^^^^^^^^^^^^^^^^^^^^^^^ File "\backend\.venv\Lib\site-packages\openparse\doc_parser.py", line 100, in parse text_elems = text.ingest(doc, parsing_method=text_engine) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "\backend\.venv\Lib\site-packages\openparse\text\parse.py", line 19, in ingest return pdfminer.ingest(doc) ^^^^^^^^^^^^^^^^^^^^ File "\backend\.venv\Lib\site-packages\openparse\text\pdfminer\core.py", line 235, in ingest lines.append(_create_line_element(text_line)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "\backend\.venv\Lib\site-packages\openparse\text\pdfminer\core.py", line 157, in _create_line_element chars = _extract_chars(text_line) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "\backend\.venv\Lib\site-packages\openparse\text\pdfminer\core.py", line 76, in _extract_chars CharElement(text=char.get_text(), fontname=last_fontname, size=last_size) File "\backend\.venv\Lib\site-packages\pydantic\main.py", line 214, in __init__ validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ pydantic_core._pydantic_core.ValidationError: 1 validation error for CharElement fontname Input should be a valid string, unable to parse raw data as a unicode string [type=string_unicode, input_value=b'ELPAGO+*\xc7\xd1\xbe\xe...xb0\xdf\xb0\xed\xb5\xf1', input_type=bytes] For further information visit https://errors.pydantic.dev/2.10/v/string_unicode
In the case of Korean PDF, there is a case where the data type of char.fontname is bytes in line 64 of the text/pdfminer/core.py file.
Example Code
semantic_pipeline = processing.SemanticIngestionPipeline(
openai_api_key="",
model="text-embedding-3-large",
min_tokens=64,
max_tokens=1024,
)
parser = DocumentParser(
processing_pipeline=semantic_pipeline,
)
parsed_content = parser.parse(file_path)Python, open-parse & OS Version
python_version: 3.12.9
operating_system: Windows
os_version: 11
open-parse version: 0.7.0
install path: \backend\.venv\Lib\site-packages\openparse
python version: 3.12.9 (tags/v3.12.9:fdb8142, Feb 4 2025, 15:27:58) [MSC v.1942 64 bit (AMD64)]
platform: Windows-11-10.0.26100-SP0
related packages: pydantic-2.10.6 PyMuPDF-1.25.4