Fix/subagent resilience#45
Closed
jtung1027 wants to merge 8 commits into
Closed
Conversation
Cursor subagents are short-lived processes that disconnect mid-stream as
normal lifecycle behaviour. The rapid churn of broken-pipe events was
destabilising the proxy, causing a cascade of crashes and failed restarts.
Changes:
- server.py: add SO_REUSEPORT (with graceful fallback) so a freshly-started
proxy can bind even while a dying predecessor holds the socket in CLOSE_WAIT
- server.py: downgrade client-disconnect log messages from WARNING to INFO;
broken pipes are expected noise with subagents, not errors
- server.py: abort upstream socket drain immediately on client disconnect
instead of continuing to read DeepSeek tokens that will never be delivered;
also skip the final reasoning-cache write when there is nothing to cache
- reasoning_store.py: add PRAGMA busy_timeout=5000 so SQLite waits up to
5 s for a concurrent write lock instead of immediately raising
OperationalError("database is locked")
- reasoning_store.py: enable WAL journal mode for better concurrent read/write
throughput under parallel subagent requests
- reasoning_store.py: retry startup prune up to 3 times with 0.5 s wait;
degrade gracefully (log warning, skip prune) rather than crashing if the
DB is still locked after retries
- tests/test_server.py: update two tests that asserted WARNING-level logs for
client-disconnect events to assert INFO-level instead
Co-authored-by: Cursor <cursoragent@cursor.com>
…t it) Co-authored-by: Cursor <cursoragent@cursor.com>
Author
|
actually, still experiencing issues. i may have just masked them with waits and not solved underlying problem...let me retry |
…load The core issue causing multi-second freezes was that every streaming write called _prune_locked() — a full-table O(N log N) sort — while holding the only lock in the store, blocking all concurrent reads. Changes: - Thread-local read connections: each handler thread gets its own SQLite connection; WAL mode handles file-level concurrency so reads never wait on writes and vice versa. - created_at index: turns age and row-count prune queries from O(N log N) to O(log N) / O(max_rows). - Throttled row-count prune: skip the expensive NOT-IN subquery when the DB is well under capacity; always prune when over the limit. - SQLite pragma tuning: 32 MiB write cache, 8 MiB per read connection, synchronous=NORMAL (safe under WAL). - _proxy_already_running() health check in main(): new startup attempts exit cleanly if the proxy is already serving, eliminating the "Address already in use" cascade. - autostart.bash rewritten as a status-check-only script: the proxy is started once manually; agents never trigger auto-start logic. - CONCURRENT_DESIGN.md: full architecture + root-cause + fix narrative. All 96 tests pass. Co-authored-by: Cursor <cursoragent@cursor.com>
- Write connection cache_size: 8 MiB → 512 MiB (holds entire 300 MB DB) - Read connection cache_size: 8 MiB → 64 MiB per thread - mmap_size = 2 GiB: maps the DB file into virtual address space so all connections share OS page-cache pages; reads become zero-copy and the 300 MB DB stays warm in RAM indefinitely on a 128 GB machine Co-authored-by: Cursor <cursoragent@cursor.com>
Row limit: 100k → 5M (theoretical max ~27 GB at current row size) Max age: 30d → 1yr (conversations from the last year always restore) mmap_size: 2 GiB → 32 GiB ceiling (covers full 5M-row DB in virtual space) The row-count prune fires only when rows actually exceed 5M; at current usage (55,925 rows) it will not trigger for years. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- All log lines now include a timestamp regardless of verbose mode. Before: INFO lines had no timestamp (message only). After: 2026-05-24 17:31:04,821 INFO ┌ cursor model=... - Verbose mode adds [ThreadName] for concurrent request debugging. - _trim_log_file() trims autostart.log to 5,000 most-recent lines at each proxy startup, keeping the same filename so open file handles in Cursor stay valid and always show the latest content. - Update tests to match new format (timestamps always present). Co-authored-by: Cursor <cursoragent@cursor.com>
Author
|
actually ,i had some of my own issues that i caused by adding proxy starts on new terminal starts, so this isn't an issue i think for others, just myself. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Improve resilience against crash cascades triggered by short-lived Cursor subagents and fix a concurrent-startup database lock issue.
Changes
When a Cursor subagent disconnects mid-stream (normal — they're short-lived), the proxy now:
Stops draining the upstream socket immediately instead of reading DeepSeek tokens that nobody will receive.
Skips the reasoning store write when the client disconnected early and the accumulator has no reasoning content to cache — avoiding unnecessary DB contention on subagent churn.
Demoted client disconnected logs from WARNING to INFO since this is expected behavior, not something that needs operator attention.
2. ReasoningStore startup race fix (reasoning_store.py)
Multiple proxy instances starting nearly simultaneously (e.g. several terminals opened in quick succession, or after a crash+restart) could collide on the SQLite database:
Added PRAGMA busy_timeout = 5000 — SQLite now waits up to 5s for a concurrent writer instead of immediately raising OperationalError("database is locked").
Switched to WAL journal mode — allows one writer + multiple concurrent readers without locking.
Added retry loop with graceful degradation — if the lock persists beyond busy_timeout, the startup prune is skipped rather than crashing the whole proxy. Stale rows are harmless and will be cleaned up on the next successful prune.
3. Reverted SO_REUSEPORT (commit 9195fa9)
Removed the SO_REUSEPORT socket option — the approach was incorrect since it requires both sockets to set the option, not just one.
I have tested this change and now no more errors are logged when I have subagents spawned.
I asked my agent to use subagents to find the latest stock price for the Mag7 stocks, and all 7 were spawned and returned successfully, albeit a bit slower than I think it should take. But no errors logged, unlike before.