perf: increase heartbeat interval for LLM workloads by bertheto · Pull Request #58 · Killea/AgentChatBus

bertheto · 2026-03-10T22:31:57Z

Summary

Increase HEARTBEAT_INTERVAL from 20s to 40s to reduce unnecessary heartbeat DB writes during msg_wait
Increase AGENT_HEARTBEAT_TIMEOUT from 30s to 60s to match (prevents agents being marked offline between heartbeats)

Motivation

For LLM-based agents, response times regularly exceed 20s. The 20s heartbeat interval caused:

Unnecessary DB writes every poll tick during msg_wait (3 agents in parallel = 3 heartbeats/20s)
No functional benefit since agents were never actually offline — just slow to respond

Changes

src/tools/dispatch.py: HEARTBEAT_INTERVAL 20.0 → 40.0
src/config.py: AGENT_HEARTBEAT_TIMEOUT default 30 → 60

CRITICAL: Both values MUST be updated together. If only HEARTBEAT_INTERVAL is increased to 40s while AGENT_HEARTBEAT_TIMEOUT stays at 30s, agents would be incorrectly marked offline between heartbeats (40s > 30s = regression).

Performance impact

Reduces heartbeat DB writes by 50% during long-poll waits
Negligible impact on agent liveness detection (60s timeout is still well within reasonable bounds for LLM workloads)

Test plan

Manual test: run a multi-agent session and confirm agents stay online
Verify no agents are marked offline between heartbeats
Verify heartbeat writes are reduced (check DB or server logs)

Increase HEARTBEAT_INTERVAL from 20s to 40s and AGENT_HEARTBEAT_TIMEOUT from 30s to 60s. For LLM-based agents, response times regularly exceed 20s, causing unnecessary heartbeat DB writes every poll tick. The 20s interval was shorter than needed for typical AI agent interactions. Both values are updated together to avoid marking agents offline between heartbeats (40s > 30s old timeout would be a regression). Made-with: Cursor

The test_timeout_handling tests were patching asyncio.wait_for globally, which interfered with the event-driven msg_wait mechanism introduced in dispatch.py (asyncio.wait_for(event.wait(), timeout=1.0)). Scope the mock to src.main.asyncio.wait_for so it only intercepts calls from main.py, leaving dispatch.py's event-based wait unaffected. Note: test_api_threads_success was already failing on main before this PR (TypeError: 'coroutine' object is not iterable — caused by the threads_agents_map refactor). This fix addresses both the pre-existing failure and the new interference introduced by the event-driven msg_wait. Made-with: Cursor

The previous approach patched asyncio.wait_for/gather globally (then src.main-scoped), but api_threads nests wait_for inside gather, making the mock fragile against code structure changes. Mock the CRUD layer directly instead: - patch get_db to return a mock db connection - patch crud.thread_list, crud.thread_count, crud.threads_agents_map as AsyncMocks with controlled return values This is the correct level of abstraction: tests verify endpoint logic, not asyncio plumbing. Also fixes the pre-existing failure on main introduced by the threads_agents_map refactor (missing mock for the new third await call). Made-with: Cursor

bertheto added 3 commits March 10, 2026 23:23

Killea merged commit ef93775 into Killea:main Mar 11, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: increase heartbeat interval for LLM workloads#58

perf: increase heartbeat interval for LLM workloads#58
Killea merged 3 commits intoKillea:mainfrom
bertheto:perf/heartbeat-interval-tuning

bertheto commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bertheto commented Mar 10, 2026

Summary

Motivation

Changes

Performance impact

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants