Skip to content

DB latency amplifies significantly across MCP tool calls #2995

@tlgimenes

Description

@tlgimenes

Summary

A 5s per-TCP-segment latency on the Postgres connection results in 45-60s total time for a single MCP tool call (COLLECTION_CONNECTIONS_LIST). This is because each tool call involves multiple sequential DB roundtrips: authentication, permission checks, and the actual tool query.

Reproduction

Using the resilience test bed (tests/resilience/):

  1. Start the Docker stack
  2. Add 5s latency toxic to Postgres: curl -X POST http://127.0.0.1:18474/proxies/postgres/toxics -d '{"type":"latency","attributes":{"latency":5000},"name":"db-slow"}'
  3. Call any MCP tool and measure: a simple COLLECTION_CONNECTIONS_LIST takes ~45s
  4. With 15s latency, health check (SELECT 1) takes 15s but tool calls would take 2-3 minutes

Observed Behavior

DB Latency Health Check (SELECT 1) Tool Call (COLLECTION_CONNECTIONS_LIST)
0ms ~5ms ~30ms
5s/segment ~5s ~45-60s
15s/segment ~15s 2-3min (estimated)

Analysis

Each MCP tool call goes through this pipeline, each step hitting the DB:

  1. API key verification — look up key in apikeys table
  2. Session/user resolution — query user/session tables
  3. Organization resolution — query organization membership
  4. Permission check — query API key permissions
  5. Tool execution — the actual query (e.g., list connections)
  6. Audit logging — write audit log entry

With 5s latency per segment, 9+ DB roundtrips = 45s+ total.

Impact

  • User experience: During DB slowdowns (e.g., vacuum, replication lag), users experience tool call timeouts even though the DB is technically "up"
  • Connection pool exhaustion: Slow queries hold connections longer, reducing pool capacity for other requests
  • Cascading failures: Health checks pass (single SELECT 1 is fast enough) while actual tool calls timeout — load balancers continue routing to degraded pods

Potential Mitigations

  1. Connection pooling with statement timeout: Set statement_timeout at the pool level so individual queries fail fast
  2. Caching: Cache auth/permission lookups (they rarely change) to reduce DB roundtrips per tool call
  3. Circuit breaker on DB: If average query latency exceeds threshold, start failing fast instead of queuing
  4. Health check with representative query: Use a query that approximates real tool call cost, not just SELECT 1

Found By

Resilience test bed: tests/resilience/scenarios/postgres-slow.test.ts

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions