Skip to content

feat(entrypoints): add control-plane HTTP server sidecar (--control-port)#305

Closed
qywu wants to merge 17 commits into
lightseekorg:mainfrom
qywu:feat/http-server
Closed

feat(entrypoints): add control-plane HTTP server sidecar (--control-port)#305
qywu wants to merge 17 commits into
lightseekorg:mainfrom
qywu:feat/http-server

Conversation

@qywu
Copy link
Copy Markdown
Collaborator

@qywu qywu commented May 28, 2026

Summary

Adds an HTTP server sidecar that starts automatically alongside `tokenspeed serve` on `main_port + 1` (override with `--control-port`).

Architecture:
```
Client ──► http_server :8001
├─ /health, /get_server_info, /get_model_info,
│ /health_check, /abort ──► gRPC engine (direct)
└─ /generate, /v1/*, /flush_cache
──► smg gateway :8000 ──► gRPC engine
```

Endpoints

Method Path Backend
GET `/health` local (always 200)
GET `/get_server_info` gRPC direct
GET `/get_model_info` gRPC direct
GET `/health_check` gRPC direct
POST `/abort` gRPC direct
GET/POST `/generate` smg passthrough
POST `/v1/completions` smg passthrough (streaming supported)
POST `/v1/chat/completions` smg passthrough (streaming supported)
GET `/v1/models` smg passthrough
POST `/v1/messages` smg passthrough (Anthropic API)
POST `/v1/responses` smg passthrough (OpenAI Responses API)
POST `/flush_cache` smg passthrough

Usage

```bash

Auto-starts on port+1 (no flag needed)

tokenspeed serve --model --port 8000

→ smg gateway on :8000, HTTP server on :8001

Override control port

tokenspeed serve --model --port 8000 --control-port 9000
```

Changes

  • `python/tokenspeed/runtime/entrypoints/http_server.py` — FastAPI server with gRPC direct + smg passthrough
  • `python/tokenspeed/cli/_argsplit.py` — `--control-port` orchestrator flag
  • `python/tokenspeed/cli/serve_smg.py` — auto-start sidecar after smg ready, passing both `gateway_url` and `engine_grpc_addr`
  • `test/runtime/test_http_server.py` — unit tests for `/health` and `--control-port` parsing

Notes

qywu added 2 commits May 28, 2026 22:49
Adds tokenspeed.runtime.entrypoints.http_server — a FastAPI/uvicorn
server that wraps Engine directly, bypassing the smg+gRPC stack.

Useful for:
- RL training: /pause_generation, /continue_generation (PR lightseekorg#270),
  /init_weights_update_group, /update_weights_from_distributed,
  /release_memory_occupation, /resume_memory_occupation
- Benchmarking: direct HTTP access without smg overhead
- Testing: simpler startup, /readiness probe, no smg dependency

Endpoints:
  GET  /health, /readiness, /get_server_info, /v1/models
  POST /generate, /v1/completions, /v1/chat/completions (streaming supported)
  POST /flush_cache, /start_profile, /stop_profile
  POST /pause_generation, /continue_generation (requires PR lightseekorg#270; returns
       501 until merged)
  POST /init_weights_update_group, /update_weights_from_distributed
  POST /release_memory_occupation, /resume_memory_occupation

CLI: `tokenspeed http-server --host 0.0.0.0 --port 8080 --model <path> ...`
     (engine ServerArgs passed through after --host/--port)
Standalone: `tokenspeed-http-server --model <path> ...`
Python API: `from tokenspeed.runtime.entrypoints.http_server import run`

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Adds a lightweight control-plane HTTP server that runs alongside the smg
gateway on a separate port, proxying engine control calls to smg.

Architecture:
  Client (generation)  ──►  smg gateway  :8080  ──►  gRPC engine
  Client (control)     ──►  http_server   :8081  ──►  smg gateway  :8080

Changes:
- python/tokenspeed/runtime/entrypoints/http_server.py: FastAPI server
  that proxies /pause_generation, /continue_generation (PR lightseekorg#270),
  weight-update, /flush_cache, /start_profile, /stop_profile,
  /get_server_info, /health, /readiness to the smg gateway.
- python/tokenspeed/cli/_argsplit.py: adds --control-port to
  _ORCH_FLAGS and OrchestratorOpts.control_port (int | None).
- python/tokenspeed/cli/serve_smg.py: after smg is ready, starts the
  control server in a daemon thread when --control-port is set.
- test/runtime/test_http_server.py: unit tests for the control server
  endpoints and --control-port arg parsing.

Usage:
  tokenspeed serve --model <path> --port 8080 --control-port 8081
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
@qywu qywu changed the title feat(entrypoints): add lightweight HTTP server (no smg gateway) feat(entrypoints): add control-plane HTTP server sidecar (--control-port) May 28, 2026
…equired

Remove the requirement to pass --control-port explicitly. The control HTTP
server now starts automatically as a sidecar on user_port+1 whenever
tokenspeed serve runs. --control-port remains available as an override
for cases where port+1 is taken.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
@qywu qywu marked this pull request as ready for review May 29, 2026 00:07
@qywu qywu requested a review from a team as a code owner May 29, 2026 00:07
@qywu qywu marked this pull request as draft May 29, 2026 00:07
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 584c4dea0c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


@app.post("/release_memory_occupation")
async def release_memory_occupation(request: Request):
return await _proxy("POST", "/release_memory_occupation", await request.json())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Accept empty bodies for no-argument control calls

For POST /release_memory_occupation (and the analogous resume/pause default case), a client that sends the usual no-body control request hits await request.json() before the proxy call, so an empty body raises a JSON decode error instead of reaching the gateway. The runtime API for release_memory_occupation has no required input fields, so this sidecar should treat an empty body as {}/None rather than returning a 500 for that valid call pattern.

Useful? React with 👍 / 👎.

Comment on lines +417 to +424
_start_control_server(
gateway_url=f"http://{user_host}:{user_port}",
host=user_host,
port=control_port,
)
sys.stdout.write(
f"ts control server ready on http://{user_host}:{control_port}\n"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Verify the control server starts before declaring readiness

When the requested/default control port is already in use (or uvicorn fails to bind for any other reason), _start_control_server only launches a daemon thread and returns immediately, so this code still prints ts control server ready and the orchestrator continues with no working control plane. This affects deployments that pass --control-port and rely on this readiness line for automation; the startup path should synchronize with uvicorn binding or surface the failure before advertising the sidecar as ready.

Useful? React with 👍 / 👎.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
@qywu qywu force-pushed the feat/http-server branch from 7b63f30 to f7559d3 Compare May 29, 2026 00:12
qywu added 13 commits May 29, 2026 00:14
…readiness only

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…le stubs

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…hCheck, Abort

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…ouble-encoding

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Review fixes for lightseekorg#305:
- Reuse a single shared gRPC channel/stub instead of creating (and
  leaking) a new channel on every /get_server_info, /get_model_info,
  /health_check, /abort call.
- Map grpc.aio.AioRpcError to a clean 503 JSON response instead of an
  unhandled 500 + stack trace.
- Fix stale "Health (local)" comment (now proxied).

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
The `async with ClientSession()` block closed the session as soon as
_proxy_request returned the StreamingResponse — but FastAPI consumes the
body iterator afterward, so the upstream connection was closed mid-stream
and streaming requests raised "Connection closed." (caught only against a
real engine; a fast mock fully buffers before close and hides it).

Now the session is created without a context manager and closed in the
generator's finally (streaming) or after read() (non-streaming).

Verified against a live `tokenspeed serve` (Qwen2.5-0.5B): SSE tokens
stream incrementally through the sidecar with a proper data: [DONE]
terminator; non-streaming, /health, and gRPC-direct endpoints all 200.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
@qywu
Copy link
Copy Markdown
Collaborator Author

qywu commented May 29, 2026

Superseded by #308 — same change squashed into a single commit, plus comprehensive regression tests (test/runtime/test_http_server.py) for the streaming session-lifetime, double-encoding, gRPC channel-reuse, and gRPC-error-mapping bugs found during development.

@qywu qywu closed this May 29, 2026
qywu added a commit to qywu/tokenspeed that referenced this pull request May 29, 2026
Address review feedback on lightseekorg#305/lightseekorg#308: the orchestrator printed
"ts control server ready" right after spawning the uvicorn thread, before
the socket was bound. If the port was in use the thread died silently and
automation waiting on that line would hit a dead endpoint.

http_server.build_server() now returns an unstarted uvicorn.Server, and
_start_control_server() polls server.started (uvicorn sets it only after
the socket binds), returning False if the thread dies or times out. The
ready line is gated on success; a bind failure prints a WARNING and serving
continues (the smg gateway is independent).

Tests cover both the ready-after-bind and port-in-use paths.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant