Skip to content

[BUG] Thread-Unsafe LRU Embedding Cache Causes Data Corruption Under Load #562

@ionfwsrijan

Description

@ionfwsrijan

Description

The in-memory LRU cache (_lru_get/_lru_set at lines 36–52) uses bare dict and list with zero synchronization. Four distinct race conditions:

  1. _lru_set line 46 — _emb_lru_order.remove(key): Two threads both find key in _emb_lru_store is True. Thread A calls remove(key) successfully. Thread B calls remove(key) on the already-removed key → ValueError: list.remove(x): x not in list.

  2. Eviction race lines 48–49: Two threads both find len(_emb_lru_store) >= _EMBEDDING_LRU_MAX_SIZE. Thread A pops oldest = _emb_lru_order.pop(0), then before it can del _emb_lru_store[oldest], Thread B also pops the same index (now a different element). Eviction is corrupted.

  3. Check-then-set non-atomicity (lines 46, 50): Two threads both check if key in _emb_lru_store, both find it absent, both append to _emb_lru_order — duplicate entries in the order list.

  4. Partial-write read in _lru_get (line 39): While one thread is mid-json.dumps(vector) (line 50), another calls _lru_get and reads partially-written JSON → json.JSONDecodeError.

Impact

  • Intermittent 500 errors under concurrent load from ValueError propagating up.
  • Silent data corruption: stale/partially-written cache entries → wrong embedding vectors → semantically incorrect RAG retrieval.
  • Affects both document ingestion (embed_texts) and user queries (embed_query).
  • Timing-dependent and non-deterministic — extremely hard to debug.

Fix Required (~25 lines in 1 file)

Add import threading and a module-level lock:

_emb_cache_lock = threading.Lock()

def _lru_get(key):
    with _emb_cache_lock:
        ...

def _lru_set(key, vector):
    with _emb_cache_lock:
        ...

Or replace with cachetools.LRUCache which handles ordering and threading internally.

GSSoC '26

  • Yes, I am participating in GirlScript Summer of Code and would like to fix this.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinggssocGirlScript Summer of Code 2026 issue/PR

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions