Add standalone compute pool service integration by AlyciaBHZ · Pull Request #969 · ChronoAIProject/NyxID

AlyciaBHZ · 2026-06-17T10:18:29Z

Summary

add a standalone compute-pool-service integration instead of a NyxID core-native compute API
provide a reference worker-pull queue service, trusted worker bridge, and OpenAPI spec for the consumer API
document how to register the service through existing NyxID service + Credential Node flows

Architecture

NyxID remains the control plane: org/agent auth, service governance, credential injection, node routing, proxying, and audit metadata. Compute-specific task state, worker tokens, local model endpoints, and GPU worker protocol stay in the external service.

This PR is not a NyxID service-pool framework. Cross-service counting, quotas, metering, and load balancing should be handled by a future generic NyxID service-pool design rather than by a compute-specific core API. The compute service exposes /v1/status as a capacity signal that such a layer could use later.

Security boundary

no NyxID backend /api/v1/compute routes or compute collections
no NyxID org model changes
NyxID does not SSH into worker hosts or execute shell commands
worker-local endpoint URLs and backend tokens are not stored in NyxID
with Credential Node routing, the compute service API token can stay on the node host

Tests

node --check integrations/compute-pool-service/server.mjs
node --check integrations/compute-pool-service/worker.mjs

Notes

This replaces the closed prototype PR #967 direction. The old PR proved the worker-pull UX and safety boundary, but this PR keeps compute as a NyxID-managed external service rather than adding compute as a core backend feature.

github-actions · 2026-06-17T10:19:19Z

📊 Code coverage

No coverage components ran for this PR.

chrono-kw

Review summary: looks good to merge — clean, well-scoped, with a few non-blocking robustness notes

This is purely additive under integrations/compute-pool-service/ (sits alongside integrations/openclaw and integrations/oracle). No NyxID core changes — no backend routes, collections, model changes, or deletions — and CI's Rust/CLI/frontend jobs correctly skip. The architecture decision to keep the compute queue out of core and register it as a normal custom service behind the proxy + Credential Node is the right call, and the queue design cleanly mirrors the Oracle relay (worker-pull, lease/ack/result, client_ref idempotency, retention TTL, metadata-only logging).

Security hygiene is solid: constant-time token compare (hash → timingSafeEqual), separate consumer/worker tokens, startup refuses to run without both unless COMPUTE_POOL_DEV_INSECURE=1, 8 MB body cap, worker-label + capability sanitization, atomic 0600 store writes, no secrets logged, .gitignore excludes .env + the store. The server makes no outbound requests, so there's no SSRF surface; only the worker fetches operator-provided URLs.

Approving in spirit for a v0.1 reference integration. The items below are follow-ups, none blocking.

Minor issues

Concurrent saveStore() can corrupt the store. Every mutating request writes to one fixed ${STORE_PATH}.tmp then renames. Two near-simultaneous requests in the same process can interleave writes to that tmp file; a corrupt tmp gets renamed in, and the next loadStore() silently falls back to an empty store (total task loss). This is concurrent-request, not just multi-process — a small write queue/mutex around saveStore would be cheap insurance even for the JSON backend.
A transient /worker/ack failure fails the whole task. In executeWithHeartbeats, if an ack postJson throws (network blip), the Promise.race rejects → the task is reported failed, and the in-flight local request is not aborted (it runs on, orphaned, wasting compute). Acks should be best-effort / retried rather than fatal.
Lease vs. request-timeout ordering. Worker --request-timeout-secs defaults to 4h while the server lease COMPUTE_POOL_TASK_TIMEOUT_SECS defaults to 2h. It works in practice because each 30s ack refreshes the lease, but if acks are disrupted on a long task the lease can expire → requeue → double-dispatch. Consider making the lease ≥ request timeout, or document the dependency.
No per-consumer task ownership. Anyone holding the single API token can GET/cancel any task_id. Risk is low (v4 UUIDs are unguessable, per-user scoping is NyxID's job upstream), but worth stating that consumers sharing the service token can read each other's tasks by id.
Tests are syntax-only (node --check). Fine for a reference integration, but there's no functional coverage of the claim/lease/requeue/cancel state machine — the part most likely to regress.
Small doc gaps. openapi.yaml omits the 409 task_terminal cancel response (and, by design, the worker endpoints); servers: says localhost while the default bind is 127.0.0.1.

Items 1 and 2 are the ones I'd most want addressed before this sees real concurrent traffic; the rest are documentation/polish.

AlyciaBHZ · 2026-06-18T09:07:15Z

Addressed the two robustness notes that are most relevant before real concurrent traffic:

Serialized JSON store writes through a process-local save queue and unique per-write tmp file, so concurrent requests in one process no longer race on the same tmp path.
Made worker ack failures retryable. Transient /worker/ack errors no longer immediately fail the task; after --max-ack-failures consecutive failures, the worker aborts the local request and reports failure so compute is not orphaned.

Also tightened the docs/spec items:

Documented the lease/ack relationship and the single consumer token task-visibility boundary.
Updated OpenAPI server URL to 127.0.0.1 and added the 409 task_terminal cancel response.
Re-ran node --check for server.mjs and worker.mjs, plus git diff --check.

Leaving functional state-machine tests and durable storage as follow-up work for when this reference integration grows beyond the smoke-test JSON backend.

Add standalone compute pool service integration

9046812

Clarify compute service-pool boundary

017f12d

chrono-kw reviewed Jun 18, 2026

View reviewed changes

Address compute service robustness notes

0e71650

chrono-kw mentioned this pull request Jun 18, 2026

Add service-pool / load-balancing layer for multiple identical service instances #974

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add standalone compute pool service integration#969

Add standalone compute pool service integration#969
AlyciaBHZ wants to merge 3 commits into
ChronoAIProject:mainfrom
AlyciaBHZ:codex/compute-pool-service

AlyciaBHZ commented Jun 17, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

chrono-kw left a comment

Uh oh!

AlyciaBHZ commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AlyciaBHZ commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Security boundary

Tests

Notes

Uh oh!

github-actions Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Code coverage

Uh oh!

chrono-kw left a comment

Choose a reason for hiding this comment

Review summary: looks good to merge — clean, well-scoped, with a few non-blocking robustness notes

Minor issues

Uh oh!

AlyciaBHZ commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AlyciaBHZ commented Jun 17, 2026 •

edited

Loading

github-actions Bot commented Jun 17, 2026 •

edited

Loading