feat(entrypoints): add HTTP server sidecar alongside smg gateway#308
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 22a594cb85
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| sys.stdout.write( | ||
| f"ts control server ready on http://{user_host}:{control_port}\n" | ||
| ) |
There was a problem hiding this comment.
Don't report control server ready before bind succeeds
When the default user_port + 1 (or an explicit --control-port) is already in use, or when a client connects immediately after this line, _start_control_server() has only spawned a daemon thread and uvicorn.run() may not have bound yet or may fail entirely inside that thread. The orchestrator still prints this ready message and continues running, so automation that waits for this line can proceed with dead/unreachable control endpoints. Please wait for the sidecar to accept connections (or surface bind failure) before emitting the ready message.
Useful? React with 👍 / 👎.
Address review feedback on lightseekorg#305/lightseekorg#308: the orchestrator printed "ts control server ready" right after spawning the uvicorn thread, before the socket was bound. If the port was in use the thread died silently and automation waiting on that line would hit a dead endpoint. http_server.build_server() now returns an unstarted uvicorn.Server, and _start_control_server() polls server.started (uvicorn sets it only after the socket binds), returning False if the thread dies or times out. The ready line is gated on success; a bind failure prints a WARNING and serving continues (the smg gateway is independent). Tests cover both the ready-after-bind and port-in-use paths. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
f660ae7 to
b40dd34
Compare
|
Addressed in @codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b40dd34104
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| url=url, | ||
| headers=headers, | ||
| data=body, | ||
| timeout=aiohttp.ClientTimeout(total=600), |
There was a problem hiding this comment.
Avoid cutting off long streaming requests
When a sidecar request is a long generation stream, aiohttp.ClientTimeout(total=600) applies to the whole upstream request/response, not just idle reads, so an otherwise healthy /generate or /v1/* stream is dropped after ten minutes of token output. The smg gateway can still serve longer generations directly, but the newly advertised sidecar endpoints fail for long-context or slow-decode jobs; use no total cap or a much larger/idle-read timeout for streaming passthrough.
Useful? React with 👍 / 👎.
Adds an HTTP server that starts automatically with `tokenspeed serve` on
`main_port + 1` (override with `--control-port`). It runs in a daemon
thread next to the smg gateway and gives engine control + generation a
single HTTP surface.
Routing:
Client ──► http_server :8001
├─ /get_server_info, /get_model_info, /health_check,
│ /abort ──► gRPC engine (direct)
└─ /health, /generate, /v1/completions, /v1/chat/completions,
/v1/models, /v1/messages, /v1/responses, /flush_cache,
/start_profile, /stop_profile ──► smg gateway ──► engine
Notes:
- gRPC-direct calls share a single lazily-created channel/stub (channels
are expensive and must be reused, not created per request).
- gRPC errors map to a clean 503 instead of an unhandled 500.
- Generation passthrough preserves SSE streaming. The upstream aiohttp
session outlives the response and is closed in the stream generator's
finally (closing it eagerly would truncate the stream mid-flight).
- Non-streaming passthrough relays the raw upstream body (a JSONResponse
wrapping the already-serialized string would double-encode it).
- The orchestrator waits for the sidecar socket to bind (uvicorn
server.started) before printing "ready"; a bind failure (e.g. port in
use) prints a WARNING and serving continues, since the smg gateway is
independent.
Tests (test/runtime/test_http_server.py) run fully on mocks (no engine,
smg, or GPU) and act as regression guards for each behavior above,
including a deliberately slow streaming mock that fails against the
eager-close bug and a port-in-use check for the bind-readiness path.
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
b40dd34 to
d8e71f0
Compare
|
Addressed the streaming-cutoff P2: replaced |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ceac64e585
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| from smg_grpc_proto.generated import tokenspeed_scheduler_pb2 as pb | ||
| from smg_grpc_proto.generated import tokenspeed_scheduler_pb2_grpc as pb_grpc |
There was a problem hiding this comment.
Use the packaged smg_grpc_proto import path
In environments using the existing tokenspeed-smg-grpc-proto layout, the generated modules are imported directly from smg_grpc_proto (as the repo's fake engine and integration tests do), and there is no other repo reference to a smg_grpc_proto.generated package. When ts serve reaches _start_control_server(), importing build_server will raise ModuleNotFoundError and abort the already-started gateway/engine instead of launching the sidecar. Import the proto modules from the package path used elsewhere (from smg_grpc_proto import ...) or keep a compatibility fallback.
Useful? React with 👍 / 👎.
| f"serving continues without it\n" | ||
| ) | ||
| sys.stdout.flush() | ||
| sys.stderr.flush() |
There was a problem hiding this comment.
are those two flush necessary?
There was a problem hiding this comment.
stdout one yes — it's block-buffered when piped/redirected (normal deploy), and we block on asyncio.wait() right after, so without it the readiness line never hits the logs. Mirrors the ts serve ready flush above. stderr flush is redundant (stderr is line-buffered) — dropping it.
|
|
||
| # uvicorn sets `started = True` only after the socket is bound and serving. | ||
| loop = asyncio.get_running_loop() | ||
| deadline = loop.time() + timeout |
There was a problem hiding this comment.
Add a logging for hong long until the server is ready?
There was a problem hiding this comment.
Good call — added logger.info("control server bound in %.2fs", ...) on the success path. Gives us signal when the bind is slow or creeping toward the 30s timeout (port contention / loaded host).
HJSang
left a comment
There was a problem hiding this comment.
LGTM with minor comments.
Address review on lightseekorg#308: - stderr is line-buffered, so the explicit sys.stderr.flush() after the warning write is a no-op; remove it. The sys.stdout.flush() stays since stdout is block-buffered when piped/redirected and the process blocks on asyncio.wait() immediately after. - Log how long the control server takes to bind, for visibility into slow or near-timeout startups. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
thanks! I have addressed the comments and will merge after checks are passed |
Summary
Adds an HTTP server that starts automatically with
tokenspeed serveonmain_port + 1(override with--control-port). It runs in a daemon thread alongside the smg gateway and gives engine control + generation a single HTTP surface.Squashed replacement for #305 (single commit + comprehensive regression tests).
Routing:
Usage
Endpoints
/health/get_server_info/get_model_info/health_check/abort/generate/v1/completions/v1/chat/completions/v1/models/v1/messages/v1/responses/flush_cache/start_profile,/stop_profileImplementation notes
503 {"error": "engine unavailable"}instead of an unhandled 500.StreamingResponseand is closed in the stream generator'sfinally; closing it eagerly (e.g.async with) truncates the stream mid-flight.Response;JSONResponse(str)would double-encode the already-serialized JSON.Tests
test/runtime/test_http_server.py— runs entirely on mocks (no engine/smg/GPU), 12 tests covering each of the above as regression guards:/healthpassthrough, status-code relay, and--control-portparsing.Manual verification
Verified end-to-end against a live
tokenspeed serve(Qwen2.5-0.5B)::8001/health→OK(proxied from smg)/v1/completions→ SSE tokens stream incrementally withdata: [DONE]/v1/completions→ correct JSON, 200/get_server_info,/get_model_info(gRPC direct) → full JSON