Add standalone compute pool service integration#969
Conversation
📊 Code coverageNo coverage components ran for this PR. |
chrono-kw
left a comment
There was a problem hiding this comment.
Review summary: looks good to merge — clean, well-scoped, with a few non-blocking robustness notes
This is purely additive under integrations/compute-pool-service/ (sits alongside integrations/openclaw and integrations/oracle). No NyxID core changes — no backend routes, collections, model changes, or deletions — and CI's Rust/CLI/frontend jobs correctly skip. The architecture decision to keep the compute queue out of core and register it as a normal custom service behind the proxy + Credential Node is the right call, and the queue design cleanly mirrors the Oracle relay (worker-pull, lease/ack/result, client_ref idempotency, retention TTL, metadata-only logging).
Security hygiene is solid: constant-time token compare (hash → timingSafeEqual), separate consumer/worker tokens, startup refuses to run without both unless COMPUTE_POOL_DEV_INSECURE=1, 8 MB body cap, worker-label + capability sanitization, atomic 0600 store writes, no secrets logged, .gitignore excludes .env + the store. The server makes no outbound requests, so there's no SSRF surface; only the worker fetches operator-provided URLs.
Approving in spirit for a v0.1 reference integration. The items below are follow-ups, none blocking.
Minor issues
-
Concurrent
saveStore()can corrupt the store. Every mutating request writes to one fixed${STORE_PATH}.tmpthen renames. Two near-simultaneous requests in the same process can interleave writes to that tmp file; a corrupt tmp gets renamed in, and the nextloadStore()silently falls back to an empty store (total task loss). This is concurrent-request, not just multi-process — a small write queue/mutex aroundsaveStorewould be cheap insurance even for the JSON backend. -
A transient
/worker/ackfailure fails the whole task. InexecuteWithHeartbeats, if an ackpostJsonthrows (network blip), thePromise.racerejects → the task is reportedfailed, and the in-flight local request is not aborted (it runs on, orphaned, wasting compute). Acks should be best-effort / retried rather than fatal. -
Lease vs. request-timeout ordering. Worker
--request-timeout-secsdefaults to 4h while the server leaseCOMPUTE_POOL_TASK_TIMEOUT_SECSdefaults to 2h. It works in practice because each 30s ack refreshes the lease, but if acks are disrupted on a long task the lease can expire → requeue → double-dispatch. Consider making the lease ≥ request timeout, or document the dependency. -
No per-consumer task ownership. Anyone holding the single API token can
GET/cancelanytask_id. Risk is low (v4 UUIDs are unguessable, per-user scoping is NyxID's job upstream), but worth stating that consumers sharing the service token can read each other's tasks by id. -
Tests are syntax-only (
node --check). Fine for a reference integration, but there's no functional coverage of the claim/lease/requeue/cancel state machine — the part most likely to regress. -
Small doc gaps.
openapi.yamlomits the409 task_terminalcancel response (and, by design, the worker endpoints);servers:sayslocalhostwhile the default bind is127.0.0.1.
Items 1 and 2 are the ones I'd most want addressed before this sees real concurrent traffic; the rest are documentation/polish.
|
Addressed the two robustness notes that are most relevant before real concurrent traffic:
Also tightened the docs/spec items:
Leaving functional state-machine tests and durable storage as follow-up work for when this reference integration grows beyond the smoke-test JSON backend. |
Summary
Architecture
NyxID remains the control plane: org/agent auth, service governance, credential injection, node routing, proxying, and audit metadata. Compute-specific task state, worker tokens, local model endpoints, and GPU worker protocol stay in the external service.
This PR is not a NyxID service-pool framework. Cross-service counting, quotas, metering, and load balancing should be handled by a future generic NyxID service-pool design rather than by a compute-specific core API. The compute service exposes /v1/status as a capacity signal that such a layer could use later.
Security boundary
Tests
Notes
This replaces the closed prototype PR #967 direction. The old PR proved the worker-pull UX and safety boundary, but this PR keeps compute as a NyxID-managed external service rather than adding compute as a core backend feature.