feat(cache): replace zone io lock with striped io locks#72
Conversation
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Repository UI (base), Organization UI (inherited) Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (18)
📝 Walkthrough演讲稿本次变更将缓存系统的区域级全局 变更项
预估代码审查工作量🎯 3 (中等复杂度) | ⏱️ ~25 分钟 可能相关的PR
诗篇
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 21 minutes and 2 seconds.Comment |
There was a problem hiding this comment.
Summary
This PR successfully replaces the zone-level global io_lock with a striped lock pool, significantly improving cache I/O concurrency. The implementation is well-designed and correctly addresses the architecture gap documented in CACHE_ARCHITECTURE_GAPS.md.
Key Improvements
- Reduced contention: Different hash files no longer share a single zone-level I/O bottleneck
- Fine-grained locking: Conflicts are now scoped to hash/stripe level rather than entire zone
- Race condition fix: Compare-and-remove pattern with hash reference checks prevents deletion of concurrently rewritten files
Implementation Quality
- Clean separation of concerns with dedicated
io.rsmodule - Proper use of BTreeSet to acquire locks in deterministic order (prevents deadlocks)
- Comprehensive test coverage including striping validation and race condition scenarios
- Documentation updated to mark this gap as completed
The changes are ready to merge.
You can now have the agent implement changes and create commits directly on your pull request's source branch. Simply comment with /q followed by your request in natural language to ask the agent to make changes.
There was a problem hiding this comment.
Code Review
This pull request replaces the global cache I/O lock with a fine-grained, striped RwLock pool to improve concurrency and throughput. It also introduces a compare-and-remove mechanism for cache entries to resolve race conditions between entry removal and concurrent writes. The review feedback highlights critical performance concerns regarding O(N) index scans during file cleanup and potential logic errors in entry comparisons caused by volatile timestamp fields in the CacheIndexEntry struct.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/rginx-http/src/cache/store/revalidate.rs (1)
86-103:⚠️ Potential issue | 🔴 Critical关键并发竞态:释放
io_write后再删文件会产生 TOCTOU。Line 86-96 的写锁在执行索引删除与
remove_file之前就已释放。此窗口内并发写入可复用同一 hash 文件,随后被当前分支误删。建议修复(将索引删除与文件删除纳入同一 `io_write` 作用域)
- let refreshed = { - let _io_guard = context.zone.io_write(&cached_entry.hash).await; - build_cached_response_for_request( - &paths.body, - &metadata, - &context.request, - &context.policy, - context.read_cached_body, - ) - .await - .map_err(|error| CacheStoreError { source: Box::new(error) })? - }; - if let Some(removed) = - remove_zone_index_entry_if_matches(&context.zone, &context.key, &cached_entry).await - && removed.delete_files - { - let _ = fs::remove_file(&paths.metadata).await; - let _ = fs::remove_file(&paths.body).await; - } + let refreshed = { + let _io_guard = context.zone.io_write(&cached_entry.hash).await; + let refreshed = build_cached_response_for_request( + &paths.body, + &metadata, + &context.request, + &context.policy, + context.read_cached_body, + ) + .await + .map_err(|error| CacheStoreError { source: Box::new(error) })?; + + if let Some(removed) = remove_zone_index_entry_if_matches( + &context.zone, + &context.key, + &cached_entry, + ) + .await + && removed.delete_files + { + let _ = fs::remove_file(&paths.metadata).await; + let _ = fs::remove_file(&paths.body).await; + } + refreshed + };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/rginx-http/src/cache/store/revalidate.rs` around lines 86 - 103, The io_write guard is being dropped before you call remove_zone_index_entry_if_matches and remove_file, creating a TOCTOU where another writer can reuse the same hash and cause accidental deletion; to fix, acquire and hold the write lock returned by context.zone.io_write(&cached_entry.hash).await across the index removal and file deletions (i.e., keep the _io_guard live until after you call remove_zone_index_entry_if_matches(&context.zone, &context.key, &cached_entry).await and, if removed.delete_files, perform fs::remove_file(&paths.metadata).await and fs::remove_file(&paths.body).await while still holding the guard), so the index mutation and physical file removals occur inside the same io_write critical section that protects cached_entry.hash.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/rginx-http/src/cache/runtime/support.rs`:
- Around line 81-85: The remove_cache_files function currently swallows errors
from fs::remove_file for paths.metadata and paths.body; update
remove_cache_files (and keep using cache_paths_for_zone and the zone:
&rginx_core::CacheZone, hash: &str parameters) to check the Result of each
fs::remove_file and log a warning or debug entry on failure that includes the
hash, the path (paths.metadata / paths.body), and the error (e.g., via
tracing::warn! or log::warn!), so failures due to permissions or filesystem
issues are observable.
In `@crates/rginx-http/src/cache/store/maintenance/index_state.rs`:
- Around line 64-67: The current equality check compares the whole entry struct
(used in remove_cache_entry_if_matches / index.entries.get) which fails when
volatile fields like last_access_unix_ms change; update the comparison to only
compare stable identity fields (e.g., entry.hash and entry.base_key) or use a
dedicated version/id field if available, and leave timestamp fields out of the
equality test so concurrent updates to last_access_unix_ms won't prevent a valid
match and removal.
In `@crates/rginx-http/src/cache/tests/storage_regressions.rs`:
- Around line 236-240: io_stripe is tightly coupled to an implementation detail
(hardcoded "% 64" and DefaultHasher) which may diverge from the real sharding in
CacheIoLockPool; change the test to reuse the production shard logic or
configuration instead of duplicating it: call the public shard function or read
the shard count/stripe function exported by the cache/io lock pool (e.g., use
CacheIoLockPool's stripe function or its IO lock shard constant) so the test
uses the same hashing+modulo behavior as the production code and remove the
hardcoded "% 64" and standalone DefaultHasher usage.
---
Outside diff comments:
In `@crates/rginx-http/src/cache/store/revalidate.rs`:
- Around line 86-103: The io_write guard is being dropped before you call
remove_zone_index_entry_if_matches and remove_file, creating a TOCTOU where
another writer can reuse the same hash and cause accidental deletion; to fix,
acquire and hold the write lock returned by
context.zone.io_write(&cached_entry.hash).await across the index removal and
file deletions (i.e., keep the _io_guard live until after you call
remove_zone_index_entry_if_matches(&context.zone, &context.key,
&cached_entry).await and, if removed.delete_files, perform
fs::remove_file(&paths.metadata).await and fs::remove_file(&paths.body).await
while still holding the guard), so the index mutation and physical file removals
occur inside the same io_write critical section that protects cached_entry.hash.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI (base), Organization UI (inherited)
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5de61607-926c-4c71-89aa-37f3dc739dda
📒 Files selected for processing (14)
crates/rginx-http/src/cache/io.rscrates/rginx-http/src/cache/lookup.rscrates/rginx-http/src/cache/manager.rscrates/rginx-http/src/cache/mod.rscrates/rginx-http/src/cache/runtime.rscrates/rginx-http/src/cache/runtime/support.rscrates/rginx-http/src/cache/store.rscrates/rginx-http/src/cache/store/maintenance/index_state.rscrates/rginx-http/src/cache/store/maintenance/mod.rscrates/rginx-http/src/cache/store/revalidate.rscrates/rginx-http/src/cache/tests/mod.rscrates/rginx-http/src/cache/tests/storage_p1.rscrates/rginx-http/src/cache/tests/storage_regressions.rsdocs/CACHE_ARCHITECTURE_GAPS.md
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Agent
- GitHub Check: Test Fast
🧰 Additional context used
🪛 LanguageTool
docs/CACHE_ARCHITECTURE_GAPS.md
[uncategorized] ~62-~62: 您的意思是“"不"齐”?
Context: ...台波动 - 让缓存维护任务在高负载下仍保持可预测的延迟特征 ### 3. 补齐 range / stale / revalidate / policy 的长...
(BU)
🔇 Additional comments (13)
docs/CACHE_ARCHITECTURE_GAPS.md (1)
19-33: 文档收敛项与长期差距拆分清晰。已完成项说明、日期标注与后续章节重编号一致,便于后续追踪与维护。
Also applies to: 49-50, 62-63, 76-77
crates/rginx-http/src/cache/tests/storage_p1.rs (1)
248-248: 测试运行时锁模型更新正确。Line 248 将测试构造切换到
CacheIoLockPool,与生产路径并发模型保持一致。crates/rginx-http/src/cache/tests/mod.rs (1)
49-49: 测试辅助构造与新并发语义对齐。Line 49 的替换是必要同步更新,避免测试继续依赖 zone 级旧锁。
crates/rginx-http/src/cache/store.rs (1)
34-37: 写入路径并发控制与维护 API 迁移都合理。Line 119 的 hash 级写锁与 Line 36 的 match-based 删除接口重导出是同一语义收敛,设计一致。
Also applies to: 119-119
crates/rginx-http/src/cache/manager.rs (1)
33-33: Manager 命中读路径和故障清理路径收敛良好。Line 93 的 hash 级读锁与 Line 119 的条件删除组合,能更好避免跨 key 的不必要串行化。
Also applies to: 93-93, 119-119
crates/rginx-http/src/cache/io.rs (1)
11-65: 条带化 I/O 锁池实现设计合理。按 stripe 去重并按有序集合加锁的实现,保证了多哈希写锁获取顺序的确定性。
crates/rginx-http/src/cache/lookup.rs (1)
109-109: lookup 读路径与异常清理路径统一且正确。Line 109 与 Line 165 的读锁引入,以及失败后统一条件删除,能显著减少误删窗口。
Also applies to: 122-123, 134-135, 146-147, 164-218
crates/rginx-http/src/cache/store/revalidate.rs (1)
107-141: 元数据写回后再更新索引与返回响应的顺序正确。Line 107-141 在同一写锁作用域内完成写回、统计和索引更新,时序是稳健的。
crates/rginx-http/src/cache/runtime.rs (1)
108-109: 细粒度 I/O 锁与条件删除路径实现正确。
Line 108切到hash级读锁、Line 135使用 compare-and-remove,能避免 zone 级串行化和误删已被替换的条目,和本次并发模型目标一致。Also applies to: 135-136
crates/rginx-http/src/cache/tests/storage_regressions.rs (1)
145-234: 并发回归场景覆盖到位。该用例验证了“同 hash 写锁持有期间,旧条目移除不会误删新重写内容”的关键竞态窗口,能有效守护本 PR 的核心并发语义。
crates/rginx-http/src/cache/store/maintenance/mod.rs (1)
70-76: 基于 hash 去重 + unreferenced 删除策略实现合理。这里将文件删除从“直接删”改为“按 hash 去重后按引用关系删”,并统一走
remove_cache_files_if_unreferenced,可以显著降低并发下误删同 hash 共享文件的风险。Also applies to: 124-130, 145-146, 219-220
crates/rginx-http/src/cache/runtime/support.rs (1)
7-35: 条件移除与“仅在无引用时删文件”的执行顺序正确。先拿 hash 写锁,再做 compare-and-remove,再按
delete_files决策删除,能保证该路径在并发下的安全性和可预期性。crates/rginx-http/src/cache/mod.rs (1)
19-19: 类型与运行时结构调整方向正确。
io_locks: CacheIoLockPool与CacheIndexEntry/CachedVaryHeaderValue的PartialEq/Eq补齐了本 PR 新锁模型和条件删除 API 的基础能力,设计上是一致且必要的。Also applies to: 36-36, 46-47, 155-155, 174-175, 211-212
There was a problem hiding this comment.
Pull request overview
This PR replaces a cache-zone global I/O mutex with hash-striped RwLock coordination to improve concurrency, and hardens cache entry deletion to avoid removing newly rewritten files.
Changes:
- Introduces
CacheIoLockPool(64 stripes) and migrates cache read/write/revalidate/maintenance paths toio_read/io_write. - Updates purge/cleanup/eviction deletion to only remove cache files when the hash is unreferenced.
- Adds regression tests for cross-stripe non-blocking behavior and stale-removal vs. in-flight rewrite races.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/CACHE_ARCHITECTURE_GAPS.md | Marks the zone-global io_lock gap as completed and documents the new striped locking model. |
| crates/rginx-http/src/cache/io.rs | Adds striped per-hash RwLock pool and CacheZoneRuntime::io_read/io_write helpers. |
| crates/rginx-http/src/cache/mod.rs | Wires in the new I/O lock pool and adds PartialEq/Eq to index entry structs to support compare-and-remove. |
| crates/rginx-http/src/cache/store.rs | Uses io_write(hash) for store path and keeps coordination aligned with the new lock model. |
| crates/rginx-http/src/cache/store/revalidate.rs | Switches revalidation to hash-scoped I/O locks and compare-and-remove index deletion. |
| crates/rginx-http/src/cache/store/maintenance/mod.rs | Updates cleanup/purge/eviction to delete files via “unreferenced hash” checks instead of a global I/O lock. |
| crates/rginx-http/src/cache/store/maintenance/index_state.rs | Introduces remove_zone_index_entry_if_matches returning whether files should be deleted. |
| crates/rginx-http/src/cache/runtime/support.rs | Adds remove_cache_entry_if_matches and remove_cache_files_if_unreferenced helpers. |
| crates/rginx-http/src/cache/runtime.rs | Routes stale/invalid entry cleanup through the new compare-and-remove deletion helper. |
| crates/rginx-http/src/cache/manager.rs | Updates hit path to use io_read and cleanup via compare-and-remove removal helper. |
| crates/rginx-http/src/cache/lookup.rs | Updates metadata/body reads to io_read and cleanup via compare-and-remove removal helper. |
| crates/rginx-http/src/cache/tests/mod.rs | Updates test zone construction to use CacheIoLockPool. |
| crates/rginx-http/src/cache/tests/storage_p1.rs | Updates test runtime construction to use CacheIoLockPool. |
| crates/rginx-http/src/cache/tests/storage_regressions.rs | Adds tests for stripe isolation and stale removal vs rewrite safety. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
已处理本轮 review 提到的并发和实现问题,并已推送到 本次修复包括:
已重新验证:
|
Summary
io_lockwith striped per-hashRwLockI/O coordinationTesting