Description
The in-memory LRU cache (_lru_get/_lru_set at lines 36–52) uses bare dict and list with zero synchronization. Four distinct race conditions:
-
_lru_set line 46 — _emb_lru_order.remove(key): Two threads both find key in _emb_lru_store is True. Thread A calls remove(key) successfully. Thread B calls remove(key) on the already-removed key → ValueError: list.remove(x): x not in list.
-
Eviction race lines 48–49: Two threads both find len(_emb_lru_store) >= _EMBEDDING_LRU_MAX_SIZE. Thread A pops oldest = _emb_lru_order.pop(0), then before it can del _emb_lru_store[oldest], Thread B also pops the same index (now a different element). Eviction is corrupted.
-
Check-then-set non-atomicity (lines 46, 50): Two threads both check if key in _emb_lru_store, both find it absent, both append to _emb_lru_order — duplicate entries in the order list.
-
Partial-write read in _lru_get (line 39): While one thread is mid-json.dumps(vector) (line 50), another calls _lru_get and reads partially-written JSON → json.JSONDecodeError.
Impact
- Intermittent 500 errors under concurrent load from
ValueError propagating up.
- Silent data corruption: stale/partially-written cache entries → wrong embedding vectors → semantically incorrect RAG retrieval.
- Affects both document ingestion (
embed_texts) and user queries (embed_query).
- Timing-dependent and non-deterministic — extremely hard to debug.
Fix Required (~25 lines in 1 file)
Add import threading and a module-level lock:
_emb_cache_lock = threading.Lock()
def _lru_get(key):
with _emb_cache_lock:
...
def _lru_set(key, vector):
with _emb_cache_lock:
...
Or replace with cachetools.LRUCache which handles ordering and threading internally.
GSSoC '26
Description
The in-memory LRU cache (
_lru_get/_lru_setat lines 36–52) uses baredictandlistwith zero synchronization. Four distinct race conditions:_lru_setline 46 —_emb_lru_order.remove(key): Two threads both findkey in _emb_lru_storeis True. Thread A callsremove(key)successfully. Thread B callsremove(key)on the already-removed key →ValueError: list.remove(x): x not in list.Eviction race lines 48–49: Two threads both find
len(_emb_lru_store) >= _EMBEDDING_LRU_MAX_SIZE. Thread A popsoldest = _emb_lru_order.pop(0), then before it candel _emb_lru_store[oldest], Thread B also pops the same index (now a different element). Eviction is corrupted.Check-then-set non-atomicity (lines 46, 50): Two threads both check
if key in _emb_lru_store, both find it absent, both append to_emb_lru_order— duplicate entries in the order list.Partial-write read in
_lru_get(line 39): While one thread is mid-json.dumps(vector)(line 50), another calls_lru_getand reads partially-written JSON →json.JSONDecodeError.Impact
ValueErrorpropagating up.embed_texts) and user queries (embed_query).Fix Required (~25 lines in 1 file)
Add
import threadingand a module-level lock:Or replace with
cachetools.LRUCachewhich handles ordering and threading internally.GSSoC '26