perf: increase heartbeat interval for LLM workloads#58
Merged
Killea merged 3 commits intoKillea:mainfrom Mar 11, 2026
Merged
Conversation
Increase HEARTBEAT_INTERVAL from 20s to 40s and AGENT_HEARTBEAT_TIMEOUT from 30s to 60s. For LLM-based agents, response times regularly exceed 20s, causing unnecessary heartbeat DB writes every poll tick. The 20s interval was shorter than needed for typical AI agent interactions. Both values are updated together to avoid marking agents offline between heartbeats (40s > 30s old timeout would be a regression). Made-with: Cursor
The test_timeout_handling tests were patching asyncio.wait_for globally, which interfered with the event-driven msg_wait mechanism introduced in dispatch.py (asyncio.wait_for(event.wait(), timeout=1.0)). Scope the mock to src.main.asyncio.wait_for so it only intercepts calls from main.py, leaving dispatch.py's event-based wait unaffected. Note: test_api_threads_success was already failing on main before this PR (TypeError: 'coroutine' object is not iterable — caused by the threads_agents_map refactor). This fix addresses both the pre-existing failure and the new interference introduced by the event-driven msg_wait. Made-with: Cursor
The previous approach patched asyncio.wait_for/gather globally (then src.main-scoped), but api_threads nests wait_for inside gather, making the mock fragile against code structure changes. Mock the CRUD layer directly instead: - patch get_db to return a mock db connection - patch crud.thread_list, crud.thread_count, crud.threads_agents_map as AsyncMocks with controlled return values This is the correct level of abstraction: tests verify endpoint logic, not asyncio plumbing. Also fixes the pre-existing failure on main introduced by the threads_agents_map refactor (missing mock for the new third await call). Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Motivation
For LLM-based agents, response times regularly exceed 20s. The 20s heartbeat interval caused:
Changes
CRITICAL: Both values MUST be updated together. If only HEARTBEAT_INTERVAL is increased to 40s while AGENT_HEARTBEAT_TIMEOUT stays at 30s, agents would be incorrectly marked offline between heartbeats (40s > 30s = regression).
Performance impact
Test plan