Skip to content

Add standalone compute pool service integration#969

Open
AlyciaBHZ wants to merge 3 commits into
ChronoAIProject:mainfrom
AlyciaBHZ:codex/compute-pool-service
Open

Add standalone compute pool service integration#969
AlyciaBHZ wants to merge 3 commits into
ChronoAIProject:mainfrom
AlyciaBHZ:codex/compute-pool-service

Conversation

@AlyciaBHZ

@AlyciaBHZ AlyciaBHZ commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add a standalone compute-pool-service integration instead of a NyxID core-native compute API
  • provide a reference worker-pull queue service, trusted worker bridge, and OpenAPI spec for the consumer API
  • document how to register the service through existing NyxID service + Credential Node flows

Architecture

NyxID remains the control plane: org/agent auth, service governance, credential injection, node routing, proxying, and audit metadata. Compute-specific task state, worker tokens, local model endpoints, and GPU worker protocol stay in the external service.

This PR is not a NyxID service-pool framework. Cross-service counting, quotas, metering, and load balancing should be handled by a future generic NyxID service-pool design rather than by a compute-specific core API. The compute service exposes /v1/status as a capacity signal that such a layer could use later.

Security boundary

  • no NyxID backend /api/v1/compute routes or compute collections
  • no NyxID org model changes
  • NyxID does not SSH into worker hosts or execute shell commands
  • worker-local endpoint URLs and backend tokens are not stored in NyxID
  • with Credential Node routing, the compute service API token can stay on the node host

Tests

  • node --check integrations/compute-pool-service/server.mjs
  • node --check integrations/compute-pool-service/worker.mjs

Notes

This replaces the closed prototype PR #967 direction. The old PR proved the worker-pull UX and safety boundary, but this PR keeps compute as a NyxID-managed external service rather than adding compute as a core backend feature.

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown

📊 Code coverage

No coverage components ran for this PR.

@chrono-kw chrono-kw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary: looks good to merge — clean, well-scoped, with a few non-blocking robustness notes

This is purely additive under integrations/compute-pool-service/ (sits alongside integrations/openclaw and integrations/oracle). No NyxID core changes — no backend routes, collections, model changes, or deletions — and CI's Rust/CLI/frontend jobs correctly skip. The architecture decision to keep the compute queue out of core and register it as a normal custom service behind the proxy + Credential Node is the right call, and the queue design cleanly mirrors the Oracle relay (worker-pull, lease/ack/result, client_ref idempotency, retention TTL, metadata-only logging).

Security hygiene is solid: constant-time token compare (hashtimingSafeEqual), separate consumer/worker tokens, startup refuses to run without both unless COMPUTE_POOL_DEV_INSECURE=1, 8 MB body cap, worker-label + capability sanitization, atomic 0600 store writes, no secrets logged, .gitignore excludes .env + the store. The server makes no outbound requests, so there's no SSRF surface; only the worker fetches operator-provided URLs.

Approving in spirit for a v0.1 reference integration. The items below are follow-ups, none blocking.

Minor issues

  1. Concurrent saveStore() can corrupt the store. Every mutating request writes to one fixed ${STORE_PATH}.tmp then renames. Two near-simultaneous requests in the same process can interleave writes to that tmp file; a corrupt tmp gets renamed in, and the next loadStore() silently falls back to an empty store (total task loss). This is concurrent-request, not just multi-process — a small write queue/mutex around saveStore would be cheap insurance even for the JSON backend.

  2. A transient /worker/ack failure fails the whole task. In executeWithHeartbeats, if an ack postJson throws (network blip), the Promise.race rejects → the task is reported failed, and the in-flight local request is not aborted (it runs on, orphaned, wasting compute). Acks should be best-effort / retried rather than fatal.

  3. Lease vs. request-timeout ordering. Worker --request-timeout-secs defaults to 4h while the server lease COMPUTE_POOL_TASK_TIMEOUT_SECS defaults to 2h. It works in practice because each 30s ack refreshes the lease, but if acks are disrupted on a long task the lease can expire → requeue → double-dispatch. Consider making the lease ≥ request timeout, or document the dependency.

  4. No per-consumer task ownership. Anyone holding the single API token can GET/cancel any task_id. Risk is low (v4 UUIDs are unguessable, per-user scoping is NyxID's job upstream), but worth stating that consumers sharing the service token can read each other's tasks by id.

  5. Tests are syntax-only (node --check). Fine for a reference integration, but there's no functional coverage of the claim/lease/requeue/cancel state machine — the part most likely to regress.

  6. Small doc gaps. openapi.yaml omits the 409 task_terminal cancel response (and, by design, the worker endpoints); servers: says localhost while the default bind is 127.0.0.1.

Items 1 and 2 are the ones I'd most want addressed before this sees real concurrent traffic; the rest are documentation/polish.

@AlyciaBHZ

Copy link
Copy Markdown
Contributor Author

Addressed the two robustness notes that are most relevant before real concurrent traffic:

  • Serialized JSON store writes through a process-local save queue and unique per-write tmp file, so concurrent requests in one process no longer race on the same tmp path.
  • Made worker ack failures retryable. Transient /worker/ack errors no longer immediately fail the task; after --max-ack-failures consecutive failures, the worker aborts the local request and reports failure so compute is not orphaned.

Also tightened the docs/spec items:

  • Documented the lease/ack relationship and the single consumer token task-visibility boundary.
  • Updated OpenAPI server URL to 127.0.0.1 and added the 409 task_terminal cancel response.
  • Re-ran node --check for server.mjs and worker.mjs, plus git diff --check.

Leaving functional state-machine tests and durable storage as follow-up work for when this reference integration grows beyond the smoke-test JSON backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants