Skip to content

perf: increase heartbeat interval for LLM workloads#58

Merged
Killea merged 3 commits intoKillea:mainfrom
bertheto:perf/heartbeat-interval-tuning
Mar 11, 2026
Merged

perf: increase heartbeat interval for LLM workloads#58
Killea merged 3 commits intoKillea:mainfrom
bertheto:perf/heartbeat-interval-tuning

Conversation

@bertheto
Copy link
Contributor

Summary

  • Increase HEARTBEAT_INTERVAL from 20s to 40s to reduce unnecessary heartbeat DB writes during msg_wait
  • Increase AGENT_HEARTBEAT_TIMEOUT from 30s to 60s to match (prevents agents being marked offline between heartbeats)

Motivation

For LLM-based agents, response times regularly exceed 20s. The 20s heartbeat interval caused:

  • Unnecessary DB writes every poll tick during msg_wait (3 agents in parallel = 3 heartbeats/20s)
  • No functional benefit since agents were never actually offline — just slow to respond

Changes

  • src/tools/dispatch.py: HEARTBEAT_INTERVAL 20.0 → 40.0
  • src/config.py: AGENT_HEARTBEAT_TIMEOUT default 30 → 60

CRITICAL: Both values MUST be updated together. If only HEARTBEAT_INTERVAL is increased to 40s while AGENT_HEARTBEAT_TIMEOUT stays at 30s, agents would be incorrectly marked offline between heartbeats (40s > 30s = regression).

Performance impact

  • Reduces heartbeat DB writes by 50% during long-poll waits
  • Negligible impact on agent liveness detection (60s timeout is still well within reasonable bounds for LLM workloads)

Test plan

  • Manual test: run a multi-agent session and confirm agents stay online
  • Verify no agents are marked offline between heartbeats
  • Verify heartbeat writes are reduced (check DB or server logs)

Increase HEARTBEAT_INTERVAL from 20s to 40s and AGENT_HEARTBEAT_TIMEOUT
from 30s to 60s. For LLM-based agents, response times regularly exceed
20s, causing unnecessary heartbeat DB writes every poll tick. The 20s
interval was shorter than needed for typical AI agent interactions.

Both values are updated together to avoid marking agents offline
between heartbeats (40s > 30s old timeout would be a regression).

Made-with: Cursor
The test_timeout_handling tests were patching asyncio.wait_for globally,
which interfered with the event-driven msg_wait mechanism introduced in
dispatch.py (asyncio.wait_for(event.wait(), timeout=1.0)).

Scope the mock to src.main.asyncio.wait_for so it only intercepts calls
from main.py, leaving dispatch.py's event-based wait unaffected.

Note: test_api_threads_success was already failing on main before this PR
(TypeError: 'coroutine' object is not iterable — caused by the threads_agents_map
refactor). This fix addresses both the pre-existing failure and the new
interference introduced by the event-driven msg_wait.

Made-with: Cursor
The previous approach patched asyncio.wait_for/gather globally (then
src.main-scoped), but api_threads nests wait_for inside gather, making
the mock fragile against code structure changes.

Mock the CRUD layer directly instead:
- patch get_db to return a mock db connection
- patch crud.thread_list, crud.thread_count, crud.threads_agents_map
  as AsyncMocks with controlled return values

This is the correct level of abstraction: tests verify endpoint logic,
not asyncio plumbing. Also fixes the pre-existing failure on main
introduced by the threads_agents_map refactor (missing mock for the new
third await call).

Made-with: Cursor
@Killea Killea merged commit ef93775 into Killea:main Mar 11, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants