Fix/subagent resilience by jtung1027 · Pull Request #45 · yxlao/deepseek-cursor-proxy

jtung1027 · 2026-05-24T10:10:34Z

Summary
Improve resilience against crash cascades triggered by short-lived Cursor subagents and fix a concurrent-startup database lock issue.

Changes

Subagent crash cascade fix (server.py)
When a Cursor subagent disconnects mid-stream (normal — they're short-lived), the proxy now:

Stops draining the upstream socket immediately instead of reading DeepSeek tokens that nobody will receive.
Skips the reasoning store write when the client disconnected early and the accumulator has no reasoning content to cache — avoiding unnecessary DB contention on subagent churn.
Demoted client disconnected logs from WARNING to INFO since this is expected behavior, not something that needs operator attention.
2. ReasoningStore startup race fix (reasoning_store.py)
Multiple proxy instances starting nearly simultaneously (e.g. several terminals opened in quick succession, or after a crash+restart) could collide on the SQLite database:

Added PRAGMA busy_timeout = 5000 — SQLite now waits up to 5s for a concurrent writer instead of immediately raising OperationalError("database is locked").
Switched to WAL journal mode — allows one writer + multiple concurrent readers without locking.
Added retry loop with graceful degradation — if the lock persists beyond busy_timeout, the startup prune is skipped rather than crashing the whole proxy. Stale rows are harmless and will be cleaned up on the next successful prune.
3. Reverted SO_REUSEPORT (commit 9195fa9)
Removed the SO_REUSEPORT socket option — the approach was incorrect since it requires both sockets to set the option, not just one.

I have tested this change and now no more errors are logged when I have subagents spawned.

I asked my agent to use subagents to find the latest stock price for the Mag7 stocks, and all 7 were spawned and returned successfully, albeit a bit slower than I think it should take. But no errors logged, unlike before.

Cursor subagents are short-lived processes that disconnect mid-stream as normal lifecycle behaviour. The rapid churn of broken-pipe events was destabilising the proxy, causing a cascade of crashes and failed restarts. Changes: - server.py: add SO_REUSEPORT (with graceful fallback) so a freshly-started proxy can bind even while a dying predecessor holds the socket in CLOSE_WAIT - server.py: downgrade client-disconnect log messages from WARNING to INFO; broken pipes are expected noise with subagents, not errors - server.py: abort upstream socket drain immediately on client disconnect instead of continuing to read DeepSeek tokens that will never be delivered; also skip the final reasoning-cache write when there is nothing to cache - reasoning_store.py: add PRAGMA busy_timeout=5000 so SQLite waits up to 5 s for a concurrent write lock instead of immediately raising OperationalError("database is locked") - reasoning_store.py: enable WAL journal mode for better concurrent read/write throughput under parallel subagent requests - reasoning_store.py: retry startup prune up to 3 times with 0.5 s wait; degrade gracefully (log warning, skip prune) rather than crashing if the DB is still locked after retries - tests/test_server.py: update two tests that asserted WARNING-level logs for client-disconnect events to assert INFO-level instead Co-authored-by: Cursor <cursoragent@cursor.com>

…t it) Co-authored-by: Cursor <cursoragent@cursor.com>

jtung1027 · 2026-05-24T20:30:56Z

actually, still experiencing issues. i may have just masked them with waits and not solved underlying problem...let me retry

…load The core issue causing multi-second freezes was that every streaming write called _prune_locked() — a full-table O(N log N) sort — while holding the only lock in the store, blocking all concurrent reads. Changes: - Thread-local read connections: each handler thread gets its own SQLite connection; WAL mode handles file-level concurrency so reads never wait on writes and vice versa. - created_at index: turns age and row-count prune queries from O(N log N) to O(log N) / O(max_rows). - Throttled row-count prune: skip the expensive NOT-IN subquery when the DB is well under capacity; always prune when over the limit. - SQLite pragma tuning: 32 MiB write cache, 8 MiB per read connection, synchronous=NORMAL (safe under WAL). - _proxy_already_running() health check in main(): new startup attempts exit cleanly if the proxy is already serving, eliminating the "Address already in use" cascade. - autostart.bash rewritten as a status-check-only script: the proxy is started once manually; agents never trigger auto-start logic. - CONCURRENT_DESIGN.md: full architecture + root-cause + fix narrative. All 96 tests pass. Co-authored-by: Cursor <cursoragent@cursor.com>

- Write connection cache_size: 8 MiB → 512 MiB (holds entire 300 MB DB) - Read connection cache_size: 8 MiB → 64 MiB per thread - mmap_size = 2 GiB: maps the DB file into virtual address space so all connections share OS page-cache pages; reads become zero-copy and the 300 MB DB stays warm in RAM indefinitely on a 128 GB machine Co-authored-by: Cursor <cursoragent@cursor.com>

Row limit: 100k → 5M (theoretical max ~27 GB at current row size) Max age: 30d → 1yr (conversations from the last year always restore) mmap_size: 2 GiB → 32 GiB ceiling (covers full 5M-row DB in virtual space) The row-count prune fires only when rows actually exceed 5M; at current usage (55,925 rows) it will not trigger for years. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

- All log lines now include a timestamp regardless of verbose mode. Before: INFO lines had no timestamp (message only). After: 2026-05-24 17:31:04,821 INFO ┌ cursor model=... - Verbose mode adds [ThreadName] for concurrent request debugging. - _trim_log_file() trims autostart.log to 5,000 most-recent lines at each proxy startup, keeping the same filename so open file handles in Cursor stay valid and always show the latest content. - Update tests to match new format (timestamps always present). Co-authored-by: Cursor <cursoragent@cursor.com>

jtung1027 · 2026-05-24T21:33:51Z

actually ,i had some of my own issues that i caused by adding proxy starts on new terminal starts, so this isn't an issue i think for others, just myself.

jtung1027 and others added 3 commits May 24, 2026 05:45

fix: remove SO_REUSEPORT (wrong approach, requires both sockets to se…

9195fa9

…t it) Co-authored-by: Cursor <cursoragent@cursor.com>

fix: apply black formatting

9ff0815

jtung1027 and others added 5 commits May 24, 2026 17:01

style: apply black formatting

57f14e8

Co-authored-by: Cursor <cursoragent@cursor.com>

jtung1027 closed this May 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/subagent resilience#45

Fix/subagent resilience#45
jtung1027 wants to merge 8 commits into
yxlao:mainfrom
jtung1027:fix/subagent-resilience

jtung1027 commented May 24, 2026

Uh oh!

jtung1027 commented May 24, 2026

Uh oh!

jtung1027 commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jtung1027 commented May 24, 2026

Uh oh!

jtung1027 commented May 24, 2026

Uh oh!

jtung1027 commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant