Skip to content

Windows cache validation fails with UnicodeDecodeError for UTF-8 metadata containing non-ASCII file paths #202

@Pililink

Description

@Pililink

Summary

On Windows with a default CP936/GBK locale, a cached index can fail to load when metadata.json contains UTF-8 non-ASCII file paths.

The first semble search run succeeds and writes the cache. The second run, which validates the cached index, crashes before returning results.

Environment

  • OS: Windows
  • Semble: 0.3.4
  • Python default text encoding: cp936 / GBK
  • PYTHONUTF8 unset
  • PYTHONIOENCODING unset

Reproduction

Using a clean temporary directory with a non-ASCII filename:

$env:PYTHONUTF8 = $null
$env:PYTHONIOENCODING = $null
$env:SEMBLE_CACHE_LOCATION = "<REPRO_DIR>\.semble-cache"

New-Item -ItemType Directory -Force "<REPRO_DIR>\docs"
Set-Content -Encoding utf8 "<REPRO_DIR>\docs\测试检查清单.md" "# 测试检查清单`n`nCache encoding repro."

uvx --from semble==0.3.4 semble search "测试检查" "<REPRO_DIR>" -k 3 --content docs
uvx --from semble==0.3.4 semble search "测试检查" "<REPRO_DIR>" -k 3 --content docs

The first run succeeds. The second run fails while validating the cache.

Observed behavior

UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 224: illegal multibyte sequence

  File "<USER_HOME>\...\Lib\site-packages\semble\index\index.py", line 149, in from_path
    cache_path = get_validated_cache(str(path), model_path, normalized)
  File "<USER_HOME>\...\Lib\site-packages\semble\cache.py", line 117, in get_validated_cache
    with open(persistence_path.metadata) as f:
  File "<USER_HOME>\...\Lib\site-packages\semble\cache.py", line 118, in get_validated_cache
    metadata = json.load(f)

Expected behavior

The second run should load or validate the cached index without crashing.

Likely cause

SembleIndex.save() writes metadata.json as UTF-8 bytes via orjson.dumps(...), but get_validated_cache() reads it using open(persistence_path.metadata) without an explicit encoding. On Windows, that uses the active locale encoding (cp936 / GBK here), so UTF-8 bytes from non-ASCII file paths can fail to decode.

SembleIndex.load_from_disk() already reads the same metadata file in binary mode and parses it with orjson.loads(...), so cache validation could likely use the same approach.

Suggested fix

Either read the metadata as UTF-8 explicitly:

with open(persistence_path.metadata, encoding="utf-8") as f:
    metadata = json.load(f)

or keep it consistent with load_from_disk():

with open(persistence_path.metadata, "rb") as f:
    metadata = orjson.loads(f.read())

Workaround

Setting PYTHONUTF8=1 avoids the crash:

$env:PYTHONUTF8 = "1"

Setting only PYTHONIOENCODING=utf-8 does not fix it, because it does not affect normal file open() calls.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions