Skip to content

fix(bridge): split NotificationCache out of Translator lock to eliminate notification_pump starvation #104

@bug-ops

Description

@bug-ops

Problem

notification_pump and MCP request handlers share a single Arc<Mutex<Translator>>. Under a slow or busy LSP server this produces a head-of-line deadlock:

  1. get_diagnostics acquires the lock and calls pull_diagnostics — a textDocument/diagnostic request with a 30 s timeout.
  2. While the pull is in-flight, the LSP server publishes publishDiagnostics on stdout. notification_pump tries to acquire the same lock and stalls.
  3. The bounded notification channel fills. The bridge's stdout reader back-pressures and stops reading.
  4. The textDocument/diagnostic response the lock-holder is waiting for can no longer arrive. The system stalls until the 30 s timeout fires.

The drop(guard) before rx.recv() in lib.rs:123 shows the intent to release the lock between messages, but the pump re-acquires it immediately on the next iteration — so under sustained push traffic it never releases long enough.

Introduced implicitly by #103 which wired publishDiagnostics into the pump for the first time, making the contention window reachable in normal operation.

Affected code

  • crates/mcpls-core/src/lib.rsnotification_pump (line 99) and the Mutex<Translator> it shares with serve
  • crates/mcpls-core/src/bridge/translator.rsnotification_cache_mut accessor used by the pump

Fix

Extract NotificationCache into its own Arc<Mutex<NotificationCache>>, independent of Translator. The pump then holds only the cache lock (a fast HashMap::insert), never competing with request handlers that hold the translator lock across LSP round-trips.

// lib.rs — before
async fn notification_pump(
    ...,
    translator: Arc<Mutex<Translator>>,
) {
    while let Some(note) = rx.recv().await {
        let mut guard = translator.lock().await;  // contends with request handlers
        let cache = guard.notification_cache_mut();
        // ...
    }
}

// lib.rs — after
async fn notification_pump(
    ...,
    cache: Arc<Mutex<NotificationCache>>,
) {
    while let Some(note) = rx.recv().await {
        let mut guard = cache.lock().await;  // independent lock, never held across LSP I/O
        // ...
    }
}

Blocked By

#102

Severity

High — reproducible under any LSP server that pushes diagnostics during a get_diagnostics call (rust-analyzer, tsgo, pyright). Manifests as a 30 s stall on every diagnostics request once the notification channel fills.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High: degraded UX, incorrect non-destructive behaviorbugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions