Skip to content

feat(entrypoints): add HTTP server sidecar alongside smg gateway#308

Merged
lightseek-bot merged 5 commits into
lightseekorg:mainfrom
qywu:feat/http-server-squashed
Jun 1, 2026
Merged

feat(entrypoints): add HTTP server sidecar alongside smg gateway#308
lightseek-bot merged 5 commits into
lightseekorg:mainfrom
qywu:feat/http-server-squashed

Conversation

@qywu
Copy link
Copy Markdown
Collaborator

@qywu qywu commented May 29, 2026

Summary

Adds an HTTP server that starts automatically with tokenspeed serve on main_port + 1 (override with --control-port). It runs in a daemon thread alongside the smg gateway and gives engine control + generation a single HTTP surface.

Squashed replacement for #305 (single commit + comprehensive regression tests).

Routing:

Client ──► http_server :8001
             ├─ /get_server_info, /get_model_info, /health_check,
             │  /abort                        ──► gRPC engine (direct)
             └─ /health, /generate, /v1/completions, /v1/chat/completions,
                /v1/models, /v1/messages, /v1/responses, /flush_cache,
                /start_profile, /stop_profile ──► smg gateway ──► engine

Usage

# Auto-starts on port+1, no flag needed
tokenspeed serve --model <path> --port 8000
# → smg gateway on :8000, HTTP sidecar on :8001

# Override the sidecar port
tokenspeed serve --model <path> --port 8000 --control-port 9000

Endpoints

Method Path Backend
GET /health smg passthrough (real engine health)
GET /get_server_info gRPC direct
GET /get_model_info gRPC direct
GET /health_check gRPC direct
POST /abort gRPC direct
GET/POST /generate smg passthrough
POST /v1/completions smg passthrough (streaming)
POST /v1/chat/completions smg passthrough (streaming)
GET /v1/models smg passthrough
POST /v1/messages smg passthrough (Anthropic API)
POST /v1/responses smg passthrough (OpenAI Responses API)
POST /flush_cache smg passthrough
GET/POST /start_profile, /stop_profile smg passthrough

Implementation notes

  • gRPC channel reuse — direct calls share one lazily-created channel/stub; channels are expensive and must not be created per request (fd/socket leak).
  • gRPC error mapping — engine RPC failures return a clean 503 {"error": "engine unavailable"} instead of an unhandled 500.
  • Streaming session lifetime — the upstream aiohttp session outlives the StreamingResponse and is closed in the stream generator's finally; closing it eagerly (e.g. async with) truncates the stream mid-flight.
  • Faithful non-streaming body — relays the raw upstream bytes via Response; JSONResponse(str) would double-encode the already-serialized JSON.

Tests

test/runtime/test_http_server.py — runs entirely on mocks (no engine/smg/GPU), 12 tests covering each of the above as regression guards:

  • Streaming uses a deliberately slow mock (sleeps between chunks). The original fast mock buffered the whole body before the session closed and hid the bug — this one fails against the eager-close version and passes with the fix (verified by temporarily reverting).
  • Non-streaming asserts the body is byte-faithful JSON (not a quoted string).
  • gRPC stub identity (reuse), gRPC→503 mapping, plain-text /health passthrough, status-code relay, and --control-port parsing.

Manual verification

Verified end-to-end against a live tokenspeed serve (Qwen2.5-0.5B):

  • :8001/healthOK (proxied from smg)
  • streaming /v1/completions → SSE tokens stream incrementally with data: [DONE]
  • non-streaming /v1/completions → correct JSON, 200
  • /get_server_info, /get_model_info (gRPC direct) → full JSON
  • gRPC failure → clean 503

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 22a594cb85

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/cli/serve_smg.py Outdated
Comment on lines +430 to +432
sys.stdout.write(
f"ts control server ready on http://{user_host}:{control_port}\n"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't report control server ready before bind succeeds

When the default user_port + 1 (or an explicit --control-port) is already in use, or when a client connects immediately after this line, _start_control_server() has only spawned a daemon thread and uvicorn.run() may not have bound yet or may fail entirely inside that thread. The orchestrator still prints this ready message and continues running, so automation that waits for this line can proceed with dead/unreachable control endpoints. Please wait for the sidecar to accept connections (or surface bind failure) before emitting the ready message.

Useful? React with 👍 / 👎.

qywu added a commit to qywu/tokenspeed that referenced this pull request May 29, 2026
Address review feedback on lightseekorg#305/lightseekorg#308: the orchestrator printed
"ts control server ready" right after spawning the uvicorn thread, before
the socket was bound. If the port was in use the thread died silently and
automation waiting on that line would hit a dead endpoint.

http_server.build_server() now returns an unstarted uvicorn.Server, and
_start_control_server() polls server.started (uvicorn sets it only after
the socket binds), returning False if the thread dies or times out. The
ready line is gated on success; a bind failure prints a WARNING and serving
continues (the smg gateway is independent).

Tests cover both the ready-after-bind and port-in-use paths.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
@qywu qywu force-pushed the feat/http-server-squashed branch from f660ae7 to b40dd34 Compare May 29, 2026 06:04
@qywu
Copy link
Copy Markdown
Collaborator Author

qywu commented May 29, 2026

Addressed in b40dd34 (force-pushed after the reviewed commit 22a594c): _start_control_server is now async and polls uvicorn.Server.started — it only reports ts control server ready once the socket is bound, and prints a WARNING (serving continues) if the port is in use or it times out. Added regression tests for both the ready-after-bind and port-in-use paths.

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b40dd34104

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

url=url,
headers=headers,
data=body,
timeout=aiohttp.ClientTimeout(total=600),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid cutting off long streaming requests

When a sidecar request is a long generation stream, aiohttp.ClientTimeout(total=600) applies to the whole upstream request/response, not just idle reads, so an otherwise healthy /generate or /v1/* stream is dropped after ten minutes of token output. The smg gateway can still serve longer generations directly, but the newly advertised sidecar endpoints fail for long-context or slow-decode jobs; use no total cap or a much larger/idle-read timeout for streaming passthrough.

Useful? React with 👍 / 👎.

Adds an HTTP server that starts automatically with `tokenspeed serve` on
`main_port + 1` (override with `--control-port`). It runs in a daemon
thread next to the smg gateway and gives engine control + generation a
single HTTP surface.

Routing:
  Client ──► http_server :8001
               ├─ /get_server_info, /get_model_info, /health_check,
               │  /abort                        ──► gRPC engine (direct)
               └─ /health, /generate, /v1/completions, /v1/chat/completions,
                  /v1/models, /v1/messages, /v1/responses, /flush_cache,
                  /start_profile, /stop_profile ──► smg gateway ──► engine

Notes:
- gRPC-direct calls share a single lazily-created channel/stub (channels
  are expensive and must be reused, not created per request).
- gRPC errors map to a clean 503 instead of an unhandled 500.
- Generation passthrough preserves SSE streaming. The upstream aiohttp
  session outlives the response and is closed in the stream generator's
  finally (closing it eagerly would truncate the stream mid-flight).
- Non-streaming passthrough relays the raw upstream body (a JSONResponse
  wrapping the already-serialized string would double-encode it).
- The orchestrator waits for the sidecar socket to bind (uvicorn
  server.started) before printing "ready"; a bind failure (e.g. port in
  use) prints a WARNING and serving continues, since the smg gateway is
  independent.

Tests (test/runtime/test_http_server.py) run fully on mocks (no engine,
smg, or GPU) and act as regression guards for each behavior above,
including a deliberately slow streaming mock that fails against the
eager-close bug and a port-in-use check for the bind-readiness path.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
@qywu qywu force-pushed the feat/http-server-squashed branch from b40dd34 to d8e71f0 Compare May 29, 2026 06:31
@qywu
Copy link
Copy Markdown
Collaborator Author

qywu commented May 29, 2026

Addressed the streaming-cutoff P2: replaced ClientTimeout(total=600) with ClientTimeout(total=None, sock_connect=30, sock_read=600). Dropping the wall-clock total cap means a long but actively-streaming generation is never aborted mid-flight; sock_read still bounds inactivity so a genuinely hung upstream is detected. Added a regression test asserting total is None with the connect/read bounds set.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ceac64e585

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +24 to +25
from smg_grpc_proto.generated import tokenspeed_scheduler_pb2 as pb
from smg_grpc_proto.generated import tokenspeed_scheduler_pb2_grpc as pb_grpc
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use the packaged smg_grpc_proto import path

In environments using the existing tokenspeed-smg-grpc-proto layout, the generated modules are imported directly from smg_grpc_proto (as the repo's fake engine and integration tests do), and there is no other repo reference to a smg_grpc_proto.generated package. When ts serve reaches _start_control_server(), importing build_server will raise ModuleNotFoundError and abort the already-started gateway/engine instead of launching the sidecar. Import the proto modules from the package path used elsewhere (from smg_grpc_proto import ...) or keep a compatibility fallback.

Useful? React with 👍 / 👎.

Comment thread python/tokenspeed/cli/serve_smg.py Outdated
f"serving continues without it\n"
)
sys.stdout.flush()
sys.stderr.flush()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are those two flush necessary?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stdout one yes — it's block-buffered when piped/redirected (normal deploy), and we block on asyncio.wait() right after, so without it the readiness line never hits the logs. Mirrors the ts serve ready flush above. stderr flush is redundant (stderr is line-buffered) — dropping it.

Comment thread python/tokenspeed/cli/serve_smg.py Outdated

# uvicorn sets `started = True` only after the socket is bound and serving.
loop = asyncio.get_running_loop()
deadline = loop.time() + timeout
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a logging for hong long until the server is ready?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — added logger.info("control server bound in %.2fs", ...) on the success path. Gives us signal when the bind is slow or creeping toward the 30s timeout (port contention / loaded host).

Copy link
Copy Markdown
Collaborator

@HJSang HJSang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor comments.

qywu added 2 commits June 1, 2026 18:52
Address review on lightseekorg#308:
- stderr is line-buffered, so the explicit sys.stderr.flush() after the
  warning write is a no-op; remove it. The sys.stdout.flush() stays since
  stdout is block-buffered when piped/redirected and the process blocks on
  asyncio.wait() immediately after.
- Log how long the control server takes to bind, for visibility into slow
  or near-timeout startups.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
@qywu
Copy link
Copy Markdown
Collaborator Author

qywu commented Jun 1, 2026

LGTM with minor comments.

thanks! I have addressed the comments and will merge after checks are passed

@lightseek-bot lightseek-bot merged commit ff61329 into lightseekorg:main Jun 1, 2026
28 of 31 checks passed
@qywu qywu deleted the feat/http-server-squashed branch June 1, 2026 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants