Skip to content

fix: correctly read cache metadata as UTF-8#203

Merged
stephantul merged 1 commit into
MinishLab:mainfrom
Pililink:fix/windows-cache-metadata-encoding
Jun 18, 2026
Merged

fix: correctly read cache metadata as UTF-8#203
stephantul merged 1 commit into
MinishLab:mainfrom
Pililink:fix/windows-cache-metadata-encoding

Conversation

@Pililink

Copy link
Copy Markdown
Contributor

Summary

  • read cached metadata.json with explicit UTF-8 encoding during cache validation
  • add a regression test for non-ASCII cached file paths under a simulated CP936 default encoding

Root cause

SembleIndex.save() writes cache metadata as UTF-8 JSON bytes, but get_validated_cache() read the same file using the platform default text encoding. On Windows systems using CP936/GBK, UTF-8 metadata containing non-ASCII file paths could fail during cache validation before the cached index was loaded.

Fixes #202

Validation

  • uv run --extra dev pytest tests/test_cache.py
  • uv run --extra dev ruff check src/semble/cache.py tests/test_cache.py
  • uv run --extra dev --extra mcp pytest -k "not test_walk_files_skips_symlinks"
  • Manual CLI repro: two consecutive searches against a temp repo containing docs\\测试检查清单.md with PYTHONUTF8 and PYTHONIOENCODING unset

Note: full uv run --extra dev --extra mcp pytest reaches 271 passed and then fails only on tests/test_file_walker.py::test_walk_files_skips_symlinks because this Windows environment lacks permission to create symlinks (WinError 1314).

@greptile-apps

greptile-apps Bot commented Jun 18, 2026

Copy link
Copy Markdown

Confidence Score: 5/5

Safe to merge — one-line change with a targeted regression test that correctly exercises the exact failure mode.

The change is minimal: a single encoding="utf-8" argument added to one open() call. The rest of the codebase already uses binary mode for persistence I/O, so there are no other text-encoding gaps to worry about. The regression test is well-constructed — it writes UTF-8 metadata, patches builtins.open to inject CP936 only when no encoding is supplied, and uses a git URL to skip file-walking, exercising exactly the path that was broken.

No files require special attention.

Reviews (1): Last reviewed commit: "修复 Windows 缓存元数据编码读取" | Re-trigger Greptile

@stephantul stephantul changed the title Read cache metadata as UTF-8 on Windows fix: correctly read cache metadata as UTF-8 Jun 18, 2026
@stephantul stephantul merged commit 5c75f9e into MinishLab:main Jun 18, 2026
15 checks passed
@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines Coverage Δ
src/semble/cache.py 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Windows cache validation fails with UnicodeDecodeError for UTF-8 metadata containing non-ASCII file paths

2 participants