Skip to content

Add robustness improvements: retries, diagnostics, better error reporting#38

Open
idStar-bot wants to merge 1 commit into
ZeframLou:mainfrom
idStar-bot:feat/robustness-improvements
Open

Add robustness improvements: retries, diagnostics, better error reporting#38
idStar-bot wants to merge 1 commit into
ZeframLou:mainfrom
idStar-bot:feat/robustness-improvements

Conversation

@idStar-bot

Copy link
Copy Markdown

Summary

Hardening from ~5 months of daily call-me use. Calls that previously failed opaquely (network hiccups, slow WebSocket bring-up, provider errors) now retry automatically and explain themselves when they can't recover.

  • Configurable connection timeout — default raised 15s → 30s; tune via CALLME_CONNECTION_TIMEOUT_MS. Slow cellular legs routinely needed more than 15s to establish the media stream.
  • Automatic retry — failed call attempts retry (default 2, via CALLME_MAX_RETRIES) with a short delay between attempts.
  • /diagnostics HTTP endpoint — server health, config snapshot, active-call detail, and recent lifecycle events for humans/scripts.
  • get_diagnostics MCP tool — lets the agent inspect the same diagnostics mid-conversation instead of guessing why a call failed.
  • Call lifecycle event tracking — rolling buffer of initiated/connected/failed/ended events surfaced through both diagnostics paths.
  • Better Telnyx error parsing — provider error payloads decoded into specific messages instead of generic failures.
  • Failure diagnostics with remediation hints — when all attempts fail, the error summarizes each attempt and suggests likely causes.
  • waitForConnection status logging — periodic progress lines (WebSocket/stream state) and precise timeout reasons.

Notes

  • Rebased onto current main (post-v1.0.3: Kokoro TTS, server resilience, plugin fixes #34); re-expressed inside the new startServer() promise structure and async shutdown().
  • No behavior changes when env vars are unset beyond the new 30s default timeout.
  • Verified: bun build src/index.ts --target=bun clean; server boots and binds normally.

🤖 Generated with Claude Code

…ting

- Increase default connection timeout from 15s to 30s, configurable via
  CALLME_CONNECTION_TIMEOUT_MS environment variable
- Add automatic retry logic: CALLME_MAX_RETRIES (default 2) wraps
  initiateCall so transient failures get up to 3 total attempts with a
  2s delay between retries
- Add /diagnostics HTTP endpoint returning server health as JSON
- Add get_diagnostics MCP tool so Claude can self-diagnose call issues
- Add DiagnosticEvent tracking for call lifecycle events (initiated,
  connected, failed, ended) with a rolling 50-event buffer
- Improve Telnyx error parsing: JSON error bodies are unpacked to
  human-readable title/detail strings instead of raw JSON blobs
- Add generateFailureDiagnostics: after all retries are exhausted,
  produce a structured error with attempt summary and remediation hints
- Improve waitForConnection: periodic 5s status logging shows WebSocket
  and stream readiness state; timeout error names the specific missing
  component rather than the generic "WebSocket connection timeout"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants