feat(entrypoints): add HTTP server sidecar alongside smg gateway by qywu · Pull Request #308 · lightseekorg/tokenspeed

qywu · 2026-05-29T05:27:03Z

Summary

Adds an HTTP server that starts automatically with tokenspeed serve on main_port + 1 (override with --control-port). It runs in a daemon thread alongside the smg gateway and gives engine control + generation a single HTTP surface.

Squashed replacement for #305 (single commit + comprehensive regression tests).

Routing:

Client ──► http_server :8001
             ├─ /get_server_info, /get_model_info, /health_check,
             │  /abort                        ──► gRPC engine (direct)
             └─ /health, /generate, /v1/completions, /v1/chat/completions,
                /v1/models, /v1/messages, /v1/responses, /flush_cache,
                /start_profile, /stop_profile ──► smg gateway ──► engine

Usage

# Auto-starts on port+1, no flag needed
tokenspeed serve --model <path> --port 8000
# → smg gateway on :8000, HTTP sidecar on :8001

# Override the sidecar port
tokenspeed serve --model <path> --port 8000 --control-port 9000

Endpoints

Method	Path	Backend
GET	`/health`	smg passthrough (real engine health)
GET	`/get_server_info`	gRPC direct
GET	`/get_model_info`	gRPC direct
GET	`/health_check`	gRPC direct
POST	`/abort`	gRPC direct
GET/POST	`/generate`	smg passthrough
POST	`/v1/completions`	smg passthrough (streaming)
POST	`/v1/chat/completions`	smg passthrough (streaming)
GET	`/v1/models`	smg passthrough
POST	`/v1/messages`	smg passthrough (Anthropic API)
POST	`/v1/responses`	smg passthrough (OpenAI Responses API)
POST	`/flush_cache`	smg passthrough
GET/POST	`/start_profile`, `/stop_profile`	smg passthrough

Implementation notes

gRPC channel reuse — direct calls share one lazily-created channel/stub; channels are expensive and must not be created per request (fd/socket leak).
gRPC error mapping — engine RPC failures return a clean 503 {"error": "engine unavailable"} instead of an unhandled 500.
Streaming session lifetime — the upstream aiohttp session outlives the StreamingResponse and is closed in the stream generator's finally; closing it eagerly (e.g. async with) truncates the stream mid-flight.
Faithful non-streaming body — relays the raw upstream bytes via Response; JSONResponse(str) would double-encode the already-serialized JSON.

Tests

test/runtime/test_http_server.py — runs entirely on mocks (no engine/smg/GPU), 12 tests covering each of the above as regression guards:

Streaming uses a deliberately slow mock (sleeps between chunks). The original fast mock buffered the whole body before the session closed and hid the bug — this one fails against the eager-close version and passes with the fix (verified by temporarily reverting).
Non-streaming asserts the body is byte-faithful JSON (not a quoted string).
gRPC stub identity (reuse), gRPC→503 mapping, plain-text /health passthrough, status-code relay, and --control-port parsing.

Manual verification

Verified end-to-end against a live tokenspeed serve (Qwen2.5-0.5B):

:8001/health → OK (proxied from smg)
streaming /v1/completions → SSE tokens stream incrementally with data: [DONE]
non-streaming /v1/completions → correct JSON, 200
/get_server_info, /get_model_info (gRPC direct) → full JSON
gRPC failure → clean 503

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 22a594cb85

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-29T05:30:22Z

+        sys.stdout.write(
+            f"ts control server ready on http://{user_host}:{control_port}\n"
+        )


Don't report control server ready before bind succeeds

When the default user_port + 1 (or an explicit --control-port) is already in use, or when a client connects immediately after this line, _start_control_server() has only spawned a daemon thread and uvicorn.run() may not have bound yet or may fail entirely inside that thread. The orchestrator still prints this ready message and continues running, so automation that waits for this line can proceed with dead/unreachable control endpoints. Please wait for the sidecar to accept connections (or surface bind failure) before emitting the ready message.

Useful? React with 👍 / 👎.

Address review feedback on lightseekorg#305/lightseekorg#308: the orchestrator printed "ts control server ready" right after spawning the uvicorn thread, before the socket was bound. If the port was in use the thread died silently and automation waiting on that line would hit a dead endpoint. http_server.build_server() now returns an unstarted uvicorn.Server, and _start_control_server() polls server.started (uvicorn sets it only after the socket binds), returning False if the thread dies or times out. The ready line is gated on success; a bind failure prints a WARNING and serving continues (the smg gateway is independent). Tests cover both the ready-after-bind and port-in-use paths. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

qywu · 2026-05-29T06:08:04Z

Addressed in b40dd34 (force-pushed after the reviewed commit 22a594c): _start_control_server is now async and polls uvicorn.Server.started — it only reports ts control server ready once the socket is bound, and prints a WARNING (serving continues) if the port is in use or it times out. Added regression tests for both the ready-after-bind and port-in-use paths.

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b40dd34104

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-29T06:11:38Z

+            url=url,
+            headers=headers,
+            data=body,
+            timeout=aiohttp.ClientTimeout(total=600),


Avoid cutting off long streaming requests

When a sidecar request is a long generation stream, aiohttp.ClientTimeout(total=600) applies to the whole upstream request/response, not just idle reads, so an otherwise healthy /generate or /v1/* stream is dropped after ten minutes of token output. The smg gateway can still serve longer generations directly, but the newly advertised sidecar endpoints fail for long-context or slow-decode jobs; use no total cap or a much larger/idle-read timeout for streaming passthrough.

Useful? React with 👍 / 👎.

Adds an HTTP server that starts automatically with `tokenspeed serve` on `main_port + 1` (override with `--control-port`). It runs in a daemon thread next to the smg gateway and gives engine control + generation a single HTTP surface. Routing: Client ──► http_server :8001 ├─ /get_server_info, /get_model_info, /health_check, │ /abort ──► gRPC engine (direct) └─ /health, /generate, /v1/completions, /v1/chat/completions, /v1/models, /v1/messages, /v1/responses, /flush_cache, /start_profile, /stop_profile ──► smg gateway ──► engine Notes: - gRPC-direct calls share a single lazily-created channel/stub (channels are expensive and must be reused, not created per request). - gRPC errors map to a clean 503 instead of an unhandled 500. - Generation passthrough preserves SSE streaming. The upstream aiohttp session outlives the response and is closed in the stream generator's finally (closing it eagerly would truncate the stream mid-flight). - Non-streaming passthrough relays the raw upstream body (a JSONResponse wrapping the already-serialized string would double-encode it). - The orchestrator waits for the sidecar socket to bind (uvicorn server.started) before printing "ready"; a bind failure (e.g. port in use) prints a WARNING and serving continues, since the smg gateway is independent. Tests (test/runtime/test_http_server.py) run fully on mocks (no engine, smg, or GPU) and act as regression guards for each behavior above, including a deliberately slow streaming mock that fails against the eager-close bug and a port-in-use check for the bind-readiness path. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

qywu · 2026-05-29T06:32:05Z

Addressed the streaming-cutoff P2: replaced ClientTimeout(total=600) with ClientTimeout(total=None, sock_connect=30, sock_read=600). Dropping the wall-clock total cap means a long but actively-streaming generation is never aborted mid-flight; sock_read still bounds inactivity so a genuinely hung upstream is detected. Added a regression test asserting total is None with the connect/read bounds set.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ceac64e585

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-29T06:45:14Z

+from smg_grpc_proto.generated import tokenspeed_scheduler_pb2 as pb
+from smg_grpc_proto.generated import tokenspeed_scheduler_pb2_grpc as pb_grpc


Use the packaged smg_grpc_proto import path

In environments using the existing tokenspeed-smg-grpc-proto layout, the generated modules are imported directly from smg_grpc_proto (as the repo's fake engine and integration tests do), and there is no other repo reference to a smg_grpc_proto.generated package. When ts serve reaches _start_control_server(), importing build_server will raise ModuleNotFoundError and abort the already-started gateway/engine instead of launching the sidecar. Import the proto modules from the package path used elsewhere (from smg_grpc_proto import ...) or keep a compatibility fallback.

Useful? React with 👍 / 👎.

HJSang · 2026-06-01T18:14:46Z

+                f"serving continues without it\n"
+            )
+        sys.stdout.flush()
+        sys.stderr.flush()


are those two flush necessary?

stdout one yes — it's block-buffered when piped/redirected (normal deploy), and we block on asyncio.wait() right after, so without it the readiness line never hits the logs. Mirrors the ts serve ready flush above. stderr flush is redundant (stderr is line-buffered) — dropping it.

HJSang · 2026-06-01T18:31:31Z

+
+    # uvicorn sets `started = True` only after the socket is bound and serving.
+    loop = asyncio.get_running_loop()
+    deadline = loop.time() + timeout


Add a logging for hong long until the server is ready?

Good call — added logger.info("control server bound in %.2fs", ...) on the success path. Gives us signal when the bind is slow or creeping toward the 30s timeout (port contention / loaded host).

HJSang

LGTM with minor comments.

Address review on lightseekorg#308: - stderr is line-buffered, so the explicit sys.stderr.flush() after the warning write is a no-op; remove it. The sys.stdout.flush() stays since stdout is block-buffered when piped/redirected and the process blocks on asyncio.wait() immediately after. - Log how long the control server takes to bind, for visibility into slow or near-timeout startups. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

qywu · 2026-06-01T18:59:14Z

LGTM with minor comments.

thanks! I have addressed the comments and will merge after checks are passed

qywu requested a review from a team as a code owner May 29, 2026 05:27

qywu mentioned this pull request May 29, 2026

feat(entrypoints): add control-plane HTTP server sidecar (--control-port) #305

Closed

chatgpt-codex-connector Bot reviewed May 29, 2026

View reviewed changes

qywu force-pushed the feat/http-server-squashed branch from f660ae7 to b40dd34 Compare May 29, 2026 06:04

chatgpt-codex-connector Bot reviewed May 29, 2026

View reviewed changes

qywu force-pushed the feat/http-server-squashed branch from b40dd34 to d8e71f0 Compare May 29, 2026 06:31

Merge branch 'main' into feat/http-server-squashed

ceac64e

chatgpt-codex-connector Bot reviewed May 29, 2026

View reviewed changes

qywu mentioned this pull request Jun 1, 2026

RL Development Plan: Training-Inference Consistency, Weights Sync, API Support qywu/tokenspeed#7

Closed

17 tasks

lightseek-bot requested a review from borontion June 1, 2026 04:31

lightseek-bot assigned borontion Jun 1, 2026

qywu requested a review from HJSang June 1, 2026 06:30

Merge branch 'main' into feat/http-server-squashed

0b6241e

HJSang reviewed Jun 1, 2026

View reviewed changes

HJSang approved these changes Jun 1, 2026

View reviewed changes

qywu added 2 commits June 1, 2026 18:52

Merge branch 'main' into feat/http-server-squashed

2da3cde

lightseek-bot merged commit ff61329 into lightseekorg:main Jun 1, 2026
28 of 31 checks passed

qywu deleted the feat/http-server-squashed branch June 1, 2026 21:07

		from smg_grpc_proto.generated import tokenspeed_scheduler_pb2 as pb
		from smg_grpc_proto.generated import tokenspeed_scheduler_pb2_grpc as pb_grpc

Conversation

qywu commented May 29, 2026

Summary

Usage

Endpoints

Implementation notes

Tests

Manual verification

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

qywu commented May 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

qywu commented May 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

HJSang Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

qywu Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

HJSang Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

qywu Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

HJSang left a comment

Choose a reason for hiding this comment

Uh oh!

qywu commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants