Summary
On Windows with a default CP936/GBK locale, a cached index can fail to load when metadata.json contains UTF-8 non-ASCII file paths.
The first semble search run succeeds and writes the cache. The second run, which validates the cached index, crashes before returning results.
Environment
- OS: Windows
- Semble:
0.3.4
- Python default text encoding:
cp936 / GBK
PYTHONUTF8 unset
PYTHONIOENCODING unset
Reproduction
Using a clean temporary directory with a non-ASCII filename:
$env:PYTHONUTF8 = $null
$env:PYTHONIOENCODING = $null
$env:SEMBLE_CACHE_LOCATION = "<REPRO_DIR>\.semble-cache"
New-Item -ItemType Directory -Force "<REPRO_DIR>\docs"
Set-Content -Encoding utf8 "<REPRO_DIR>\docs\测试检查清单.md" "# 测试检查清单`n`nCache encoding repro."
uvx --from semble==0.3.4 semble search "测试检查" "<REPRO_DIR>" -k 3 --content docs
uvx --from semble==0.3.4 semble search "测试检查" "<REPRO_DIR>" -k 3 --content docs
The first run succeeds. The second run fails while validating the cache.
Observed behavior
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 224: illegal multibyte sequence
File "<USER_HOME>\...\Lib\site-packages\semble\index\index.py", line 149, in from_path
cache_path = get_validated_cache(str(path), model_path, normalized)
File "<USER_HOME>\...\Lib\site-packages\semble\cache.py", line 117, in get_validated_cache
with open(persistence_path.metadata) as f:
File "<USER_HOME>\...\Lib\site-packages\semble\cache.py", line 118, in get_validated_cache
metadata = json.load(f)
Expected behavior
The second run should load or validate the cached index without crashing.
Likely cause
SembleIndex.save() writes metadata.json as UTF-8 bytes via orjson.dumps(...), but get_validated_cache() reads it using open(persistence_path.metadata) without an explicit encoding. On Windows, that uses the active locale encoding (cp936 / GBK here), so UTF-8 bytes from non-ASCII file paths can fail to decode.
SembleIndex.load_from_disk() already reads the same metadata file in binary mode and parses it with orjson.loads(...), so cache validation could likely use the same approach.
Suggested fix
Either read the metadata as UTF-8 explicitly:
with open(persistence_path.metadata, encoding="utf-8") as f:
metadata = json.load(f)
or keep it consistent with load_from_disk():
with open(persistence_path.metadata, "rb") as f:
metadata = orjson.loads(f.read())
Workaround
Setting PYTHONUTF8=1 avoids the crash:
Setting only PYTHONIOENCODING=utf-8 does not fix it, because it does not affect normal file open() calls.
Summary
On Windows with a default CP936/GBK locale, a cached index can fail to load when
metadata.jsoncontains UTF-8 non-ASCII file paths.The first
semble searchrun succeeds and writes the cache. The second run, which validates the cached index, crashes before returning results.Environment
0.3.4cp936/ GBKPYTHONUTF8unsetPYTHONIOENCODINGunsetReproduction
Using a clean temporary directory with a non-ASCII filename:
The first run succeeds. The second run fails while validating the cache.
Observed behavior
Expected behavior
The second run should load or validate the cached index without crashing.
Likely cause
SembleIndex.save()writesmetadata.jsonas UTF-8 bytes viaorjson.dumps(...), butget_validated_cache()reads it usingopen(persistence_path.metadata)without an explicit encoding. On Windows, that uses the active locale encoding (cp936/ GBK here), so UTF-8 bytes from non-ASCII file paths can fail to decode.SembleIndex.load_from_disk()already reads the same metadata file in binary mode and parses it withorjson.loads(...), so cache validation could likely use the same approach.Suggested fix
Either read the metadata as UTF-8 explicitly:
or keep it consistent with
load_from_disk():Workaround
Setting
PYTHONUTF8=1avoids the crash:Setting only
PYTHONIOENCODING=utf-8does not fix it, because it does not affect normal fileopen()calls.